The goal of this project is to compare the performance of two types of model in a given dataset. In particular, we will compare the performance of Logistic Regression (implemented by sklearn.linear_model.LogisticRegression) with that of a Random Forest (implemented by sklearn.ensemble.RandomForestClassifier). We will also optimize both algorithms' parameters and determine which one is best for this dataset.
Let's import some libraries that will be useful in the project.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
We want first to understand better the structure of the data we are working with. This is an important step in our analyis, since we want to deal with potentially problematic aspects of the data, and clean it. Moreover, by understanding its properties, such as the balance between classes, we can work with classifiers correctly and choose the right metrics to evaluate their performance.
#import the dataset
df = pd.read_csv("./data/mldata_0003164501.csv", sep = ",")
df
| Unnamed: 0 | label | feature_1 | feature_2 | feature_3 | feature_4 | feature_5 | feature_6 | feature_7 | feature_8 | ... | feature_22 | feature_23 | feature_24 | feature_25 | feature_26 | feature_27 | feature_28 | feature_29 | feature_30 | categorical_feature_1 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 3 | 0.977711 | 0.506776 | 0.274184 | 3.146263 | 0.970143 | -0.331920 | -1.369330 | -0.184021 | ... | -1.167407 | 0.740888 | -2.823582 | 1.937094 | -1.095149 | 0.707887 | -1.530358 | -0.044247 | -2.446235 | B |
| 1 | 1 | 1 | 0.047318 | -4.237021 | 0.232601 | -0.634119 | -1.314572 | 1.993623 | 1.330799 | 1.445451 | ... | -0.857609 | 0.153576 | 0.582376 | 0.808039 | 0.048286 | 0.107600 | -0.083177 | -0.093708 | -2.069849 | A |
| 2 | 2 | 2 | 1.538534 | -1.704886 | -0.926433 | 0.081837 | -2.230265 | 0.286980 | -1.556409 | 0.661266 | ... | 0.516624 | -1.141832 | -2.591078 | 1.304137 | -0.385021 | 1.064548 | -0.192902 | -1.009055 | -0.084599 | B |
| 3 | 3 | 2 | 0.677418 | 4.760305 | 3.778898 | 2.630025 | -2.763774 | -0.097002 | 4.955849 | -0.854142 | ... | -0.852140 | 1.343323 | 3.636170 | 0.677988 | 0.212772 | -0.677034 | 0.639229 | -0.675410 | 1.036157 | B |
| 4 | 4 | 1 | 1.692027 | 0.076915 | -1.736087 | 1.497523 | -0.405362 | -0.184517 | -0.707944 | 0.550167 | ... | 0.680531 | 1.500405 | -0.675614 | 0.234405 | 0.736274 | 0.146277 | 1.758071 | 1.209597 | -0.707836 | A |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1295 | 1295 | 0 | -1.106235 | 0.192021 | -3.044139 | -0.413591 | 1.622252 | -0.254356 | -0.101595 | -0.745178 | ... | -1.303725 | 0.181726 | 2.327401 | 2.433295 | 0.574041 | -3.224432 | -0.788689 | 0.533658 | -3.579361 | C |
| 1296 | 1296 | 2 | -0.546425 | -0.658874 | -1.400140 | -3.119142 | -2.378495 | -0.324184 | 1.249406 | 1.060721 | ... | 0.465890 | -1.498137 | 1.581889 | -1.957156 | -0.931723 | 0.218409 | -0.431815 | -1.814201 | 2.575537 | B |
| 1297 | 1297 | 3 | 0.866665 | 2.464064 | 3.617311 | 2.733195 | 0.794086 | -0.541519 | -3.339625 | 2.101668 | ... | 0.553514 | -0.616828 | -4.677591 | -1.403826 | 0.049761 | 0.119145 | 0.375289 | 0.681518 | 0.695855 | B |
| 1298 | 1298 | 0 | -0.326577 | 1.397554 | -1.358974 | -4.921358 | -0.640626 | -0.519134 | 2.351460 | -0.986060 | ... | -1.525018 | -0.400171 | 2.559972 | -3.349548 | 0.627411 | -0.287641 | -0.561310 | 0.642402 | -0.071863 | C |
| 1299 | 1299 | 1 | -0.486092 | 2.158804 | 3.344615 | 3.395124 | 0.099400 | 0.021686 | 1.859910 | -0.916593 | ... | -2.267329 | -0.665686 | -0.163075 | -4.088241 | -1.019011 | 1.508333 | -0.985281 | 1.070124 | -1.429840 | A |
1300 rows × 33 columns
#shape of the dataset
df.shape
(1300, 33)
We are given a dataset that has as first column the sample id, as second column the label, and then we have 31 features, of which the last one is categorical. We are given 1300 samples. First of all, we want to set the <Unnamed: 0> column as the index of the dataset.
#set the unnamed column as the index
df.set_index('Unnamed: 0', drop = True, inplace = True)
df
| label | feature_1 | feature_2 | feature_3 | feature_4 | feature_5 | feature_6 | feature_7 | feature_8 | feature_9 | ... | feature_22 | feature_23 | feature_24 | feature_25 | feature_26 | feature_27 | feature_28 | feature_29 | feature_30 | categorical_feature_1 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Unnamed: 0 | |||||||||||||||||||||
| 0 | 3 | 0.977711 | 0.506776 | 0.274184 | 3.146263 | 0.970143 | -0.331920 | -1.369330 | -0.184021 | -0.184244 | ... | -1.167407 | 0.740888 | -2.823582 | 1.937094 | -1.095149 | 0.707887 | -1.530358 | -0.044247 | -2.446235 | B |
| 1 | 1 | 0.047318 | -4.237021 | 0.232601 | -0.634119 | -1.314572 | 1.993623 | 1.330799 | 1.445451 | -0.086046 | ... | -0.857609 | 0.153576 | 0.582376 | 0.808039 | 0.048286 | 0.107600 | -0.083177 | -0.093708 | -2.069849 | A |
| 2 | 2 | 1.538534 | -1.704886 | -0.926433 | 0.081837 | -2.230265 | 0.286980 | -1.556409 | 0.661266 | 2.436296 | ... | 0.516624 | -1.141832 | -2.591078 | 1.304137 | -0.385021 | 1.064548 | -0.192902 | -1.009055 | -0.084599 | B |
| 3 | 2 | 0.677418 | 4.760305 | 3.778898 | 2.630025 | -2.763774 | -0.097002 | 4.955849 | -0.854142 | -1.733114 | ... | -0.852140 | 1.343323 | 3.636170 | 0.677988 | 0.212772 | -0.677034 | 0.639229 | -0.675410 | 1.036157 | B |
| 4 | 1 | 1.692027 | 0.076915 | -1.736087 | 1.497523 | -0.405362 | -0.184517 | -0.707944 | 0.550167 | -0.027586 | ... | 0.680531 | 1.500405 | -0.675614 | 0.234405 | 0.736274 | 0.146277 | 1.758071 | 1.209597 | -0.707836 | A |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1295 | 0 | -1.106235 | 0.192021 | -3.044139 | -0.413591 | 1.622252 | -0.254356 | -0.101595 | -0.745178 | -2.724777 | ... | -1.303725 | 0.181726 | 2.327401 | 2.433295 | 0.574041 | -3.224432 | -0.788689 | 0.533658 | -3.579361 | C |
| 1296 | 2 | -0.546425 | -0.658874 | -1.400140 | -3.119142 | -2.378495 | -0.324184 | 1.249406 | 1.060721 | -0.587969 | ... | 0.465890 | -1.498137 | 1.581889 | -1.957156 | -0.931723 | 0.218409 | -0.431815 | -1.814201 | 2.575537 | B |
| 1297 | 3 | 0.866665 | 2.464064 | 3.617311 | 2.733195 | 0.794086 | -0.541519 | -3.339625 | 2.101668 | 0.684216 | ... | 0.553514 | -0.616828 | -4.677591 | -1.403826 | 0.049761 | 0.119145 | 0.375289 | 0.681518 | 0.695855 | B |
| 1298 | 0 | -0.326577 | 1.397554 | -1.358974 | -4.921358 | -0.640626 | -0.519134 | 2.351460 | -0.986060 | -0.976389 | ... | -1.525018 | -0.400171 | 2.559972 | -3.349548 | 0.627411 | -0.287641 | -0.561310 | 0.642402 | -0.071863 | C |
| 1299 | 1 | -0.486092 | 2.158804 | 3.344615 | 3.395124 | 0.099400 | 0.021686 | 1.859910 | -0.916593 | -1.786863 | ... | -2.267329 | -0.665686 | -0.163075 | -4.088241 | -1.019011 | 1.508333 | -0.985281 | 1.070124 | -1.429840 | A |
1300 rows × 32 columns
Now let's deal with categorical data. In particular, the last feature is a categorical variable. Therefore, we will use a one-hot encoding. We have decided this approach rather than an integer encoding since in this way we are not making any assumption on the ordering between categories. Moreover, since we have only 3 types of entries, using it will not increase the dimensionality of the dataframe much.
#apply one hot encoding to the categorical variable
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(handle_unknown='ignore')
encoder_df = pd.DataFrame(encoder.fit_transform(df[['categorical_feature_1']]).toarray())
encoder_df.columns = ['A', 'B', 'C']
df = df.join(encoder_df)
df.drop('categorical_feature_1', axis = 1, inplace = True)
df
| label | feature_1 | feature_2 | feature_3 | feature_4 | feature_5 | feature_6 | feature_7 | feature_8 | feature_9 | ... | feature_24 | feature_25 | feature_26 | feature_27 | feature_28 | feature_29 | feature_30 | A | B | C | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Unnamed: 0 | |||||||||||||||||||||
| 0 | 3 | 0.977711 | 0.506776 | 0.274184 | 3.146263 | 0.970143 | -0.331920 | -1.369330 | -0.184021 | -0.184244 | ... | -2.823582 | 1.937094 | -1.095149 | 0.707887 | -1.530358 | -0.044247 | -2.446235 | 0.0 | 1.0 | 0.0 |
| 1 | 1 | 0.047318 | -4.237021 | 0.232601 | -0.634119 | -1.314572 | 1.993623 | 1.330799 | 1.445451 | -0.086046 | ... | 0.582376 | 0.808039 | 0.048286 | 0.107600 | -0.083177 | -0.093708 | -2.069849 | 1.0 | 0.0 | 0.0 |
| 2 | 2 | 1.538534 | -1.704886 | -0.926433 | 0.081837 | -2.230265 | 0.286980 | -1.556409 | 0.661266 | 2.436296 | ... | -2.591078 | 1.304137 | -0.385021 | 1.064548 | -0.192902 | -1.009055 | -0.084599 | 0.0 | 1.0 | 0.0 |
| 3 | 2 | 0.677418 | 4.760305 | 3.778898 | 2.630025 | -2.763774 | -0.097002 | 4.955849 | -0.854142 | -1.733114 | ... | 3.636170 | 0.677988 | 0.212772 | -0.677034 | 0.639229 | -0.675410 | 1.036157 | 0.0 | 1.0 | 0.0 |
| 4 | 1 | 1.692027 | 0.076915 | -1.736087 | 1.497523 | -0.405362 | -0.184517 | -0.707944 | 0.550167 | -0.027586 | ... | -0.675614 | 0.234405 | 0.736274 | 0.146277 | 1.758071 | 1.209597 | -0.707836 | 1.0 | 0.0 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1295 | 0 | -1.106235 | 0.192021 | -3.044139 | -0.413591 | 1.622252 | -0.254356 | -0.101595 | -0.745178 | -2.724777 | ... | 2.327401 | 2.433295 | 0.574041 | -3.224432 | -0.788689 | 0.533658 | -3.579361 | 0.0 | 0.0 | 1.0 |
| 1296 | 2 | -0.546425 | -0.658874 | -1.400140 | -3.119142 | -2.378495 | -0.324184 | 1.249406 | 1.060721 | -0.587969 | ... | 1.581889 | -1.957156 | -0.931723 | 0.218409 | -0.431815 | -1.814201 | 2.575537 | 0.0 | 1.0 | 0.0 |
| 1297 | 3 | 0.866665 | 2.464064 | 3.617311 | 2.733195 | 0.794086 | -0.541519 | -3.339625 | 2.101668 | 0.684216 | ... | -4.677591 | -1.403826 | 0.049761 | 0.119145 | 0.375289 | 0.681518 | 0.695855 | 0.0 | 1.0 | 0.0 |
| 1298 | 0 | -0.326577 | 1.397554 | -1.358974 | -4.921358 | -0.640626 | -0.519134 | 2.351460 | -0.986060 | -0.976389 | ... | 2.559972 | -3.349548 | 0.627411 | -0.287641 | -0.561310 | 0.642402 | -0.071863 | 0.0 | 0.0 | 1.0 |
| 1299 | 1 | -0.486092 | 2.158804 | 3.344615 | 3.395124 | 0.099400 | 0.021686 | 1.859910 | -0.916593 | -1.786863 | ... | -0.163075 | -4.088241 | -1.019011 | 1.508333 | -0.985281 | 1.070124 | -1.429840 | 1.0 | 0.0 | 0.0 |
1300 rows × 34 columns
Now that we have finished cleaning the dataset, we want to create a dataframe containing the labels for each sample, and another dataframe containing just the features. This will be useful later on.
#labels and features dataframe
df_label = df['label']
df = df.iloc[:, 1:]
df_T = df.transpose()
Now that we have fixed the structure of the data, we can start analysing more the properties of it.
#informations about the dataframe
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 1300 entries, 0 to 1299 Data columns (total 33 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 feature_1 1300 non-null float64 1 feature_2 1300 non-null float64 2 feature_3 1300 non-null float64 3 feature_4 1300 non-null float64 4 feature_5 1300 non-null float64 5 feature_6 1300 non-null float64 6 feature_7 1300 non-null float64 7 feature_8 1300 non-null float64 8 feature_9 1300 non-null float64 9 feature_10 1300 non-null float64 10 feature_11 1300 non-null float64 11 feature_12 1300 non-null float64 12 feature_13 1300 non-null float64 13 feature_14 1300 non-null float64 14 feature_15 1300 non-null float64 15 feature_16 1300 non-null float64 16 feature_17 1300 non-null float64 17 feature_18 1300 non-null float64 18 feature_19 1300 non-null float64 19 feature_20 1300 non-null float64 20 feature_21 1300 non-null float64 21 feature_22 1300 non-null float64 22 feature_23 1300 non-null float64 23 feature_24 1300 non-null float64 24 feature_25 1300 non-null float64 25 feature_26 1300 non-null float64 26 feature_27 1300 non-null float64 27 feature_28 1300 non-null float64 28 feature_29 1300 non-null float64 29 feature_30 1300 non-null float64 30 A 1300 non-null float64 31 B 1300 non-null float64 32 C 1300 non-null float64 dtypes: float64(33) memory usage: 377.6 KB
#types contained in the dataframe
df.dtypes.unique()
array([dtype('float64')], dtype=object)
#number of null entries
print("Number of NaN entries: ", df.isna().sum().sum())
Number of NaN entries: 0
Notice that the feature dataframe has only numerical features, and that we don't have any missing values.
#unique label types
print("Unique label types: ", df_label.unique())
print(df_label.value_counts())
Unique label types: [3 1 2 0] 3 343 0 334 2 314 1 309 Name: label, dtype: int64
Notice that we are dealing with a more-or-less balanced dataset. We could check also for duplicate rows / columns:
#find duplicated rows and columns
df_duplicate_rows = df[df.duplicated(keep=False)]
print("Number of duplicate rows: ", df_duplicate_rows.shape[0])
df_duplicate_cols = df_T[df_T.duplicated(keep=False)]
print("Number of duplicate cols: ", df_duplicate_cols.shape[0])
Number of duplicate rows: 0 Number of duplicate cols: 0
Now, let's check whether the data follows a normal distribution, and then we want to visualize the various samples. We are going to first analyse the skewness and the kurtosis of the dataframe. These measures tell us correspectively how symmetric is the data and how centered around the mean it is.
#skewness and kurtosis of the whole dataframe
from scipy.stats import kurtosis, skew
df_numerical = df_T.iloc[1:31, :].astype(float)
df_flattened = pd.Series(df_numerical.values.flatten())
skew_df = skew(df_flattened)
kurtosis_df = kurtosis(df_flattened)
print("Skewness of the whole dataframe: ", round(skew_df))
print("Kurtosis of the whole dataframe: ", round(kurtosis_df))
Skewness of the whole dataframe: 0 Kurtosis of the whole dataframe: 2
#skewness for each single sample
df_colnames = list(df_numerical.columns)
colN = df_numerical.shape[1]
df_skew = []
for i in range(colN):
v_df = df_T.loc['feature_1':'feature_30', df_colnames[i]].astype(float)
df_skew += [skew(v_df)]
sns.histplot(df_skew, bins=100, kde=True)
plt.show()
#kurtosis for each single sample
df_kurtosis = []
for i in range(colN):
v_df = df_T.loc[ 'feature_1':'feature_30', df_colnames[i]].astype(float)
df_kurtosis += [kurtosis(v_df)]
sns.histplot(df_kurtosis, bins=100, kde=True)
plt.show()
In our case, we see that we have low values both of skewness and kurtosis, and this means that the data is similar to a normal distribution. In particular, low skewness indicates that the data is approximately symmetrical, and low kurtosis means that the distribution has almost the same tail behavior as the normal distribution.
Now let's visualize the distribution of the samples.
#distribution of the samples
fig, ax = plt.subplots(1,1, figsize = (5,5))
for i in range(0,5):
col = df_T[i].loc['feature_1':'feature_30'].astype(float)
sns.kdeplot(col);
plt.plot()
[]
#obtaining random samples
random_columns = df_T.columns.tolist()
np.random.shuffle(random_columns)
# boxplot and violin plot of 50 samples
col = df_T.loc['feature_1':'feature_30', random_columns[0:50]].astype(float)
fig, ax= plt.subplots(2, figsize=(10, 10))
plt.suptitle("Box/violin plot of 50 random samples ")
ax[0].boxplot(col)
ax[1].violinplot(col)
ax[0].set_xlabel("Boxplot")
ax[1].set_xlabel("Violin plot")
ax[0].get_xaxis().set_visible(False)
ax[1].get_xaxis().set_visible(False)
plt.show()
Notice that also graphically the data seems to resemble a normal distribution.
Now let's analyse the distribution of the features:
#description of dataframe
df.describe()
| feature_1 | feature_2 | feature_3 | feature_4 | feature_5 | feature_6 | feature_7 | feature_8 | feature_9 | feature_10 | ... | feature_24 | feature_25 | feature_26 | feature_27 | feature_28 | feature_29 | feature_30 | A | B | C | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1300.000000 | 1300.000000 | 1300.000000 | 1300.000000 | 1300.000000 | 1300.000000 | 1300.000000 | 1300.000000 | 1300.000000 | 1300.000000 | ... | 1300.000000 | 1300.000000 | 1300.000000 | 1300.000000 | 1300.000000 | 1300.000000 | 1300.000000 | 1300.000000 | 1300.000000 | 1300.000000 |
| mean | 0.057059 | -0.358120 | -0.353362 | 0.248850 | -0.548936 | 0.113459 | -0.044129 | 0.106316 | 0.130741 | 0.050906 | ... | 0.130333 | -0.388395 | 0.061304 | 0.137229 | 0.086137 | 0.083467 | -0.200111 | 0.346154 | 0.360769 | 0.293077 |
| std | 1.016393 | 2.331683 | 2.327112 | 2.276689 | 2.238338 | 0.997191 | 2.367311 | 0.999200 | 0.995196 | 0.963253 | ... | 2.358335 | 2.388649 | 0.998118 | 1.015407 | 1.033255 | 1.004888 | 2.290431 | 0.475926 | 0.480408 | 0.455349 |
| min | -3.407686 | -6.866555 | -8.196498 | -7.137449 | -6.330767 | -3.207747 | -6.896524 | -3.064150 | -2.930279 | -3.645938 | ... | -7.352388 | -8.961599 | -3.368877 | -3.324403 | -4.199878 | -3.491512 | -9.753380 | 0.000000 | 0.000000 | 0.000000 |
| 25% | -0.628012 | -1.978855 | -1.939604 | -1.274770 | -2.134220 | -0.574609 | -1.742431 | -0.601565 | -0.522857 | -0.573218 | ... | -1.513865 | -1.936061 | -0.615284 | -0.571044 | -0.620587 | -0.572067 | -1.744427 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 0.046792 | -0.436259 | -0.388752 | 0.300348 | -0.617970 | 0.144132 | -0.059742 | 0.103381 | 0.135093 | 0.095434 | ... | 0.070840 | -0.472139 | 0.061542 | 0.118818 | 0.026834 | 0.139781 | -0.257736 | 0.000000 | 0.000000 | 0.000000 |
| 75% | 0.756940 | 1.193450 | 1.283804 | 1.830907 | 0.809936 | 0.808414 | 1.587650 | 0.816937 | 0.790001 | 0.682353 | ... | 1.699240 | 1.301831 | 0.751273 | 0.840807 | 0.846782 | 0.748021 | 1.341203 | 1.000000 | 1.000000 | 1.000000 |
| max | 3.086628 | 6.955199 | 9.391982 | 8.481066 | 8.694150 | 3.307266 | 7.817989 | 3.569670 | 3.331400 | 2.914597 | ... | 9.359668 | 7.756389 | 3.098626 | 3.138477 | 3.144868 | 3.224480 | 9.163436 | 1.000000 | 1.000000 | 1.000000 |
8 rows × 33 columns
# boxplot and violin plot of the features
col = df.loc[:, 'feature_1':'feature_30'].astype(float)
fig, ax= plt.subplots(2, figsize=(10, 10))
plt.suptitle("Box/violin plot of 50 random samples ")
ax[0].boxplot(col)
ax[1].violinplot(col)
ax[0].set_xlabel("Boxplot")
ax[1].set_xlabel("Violin plot")
ax[0].get_xaxis().set_visible(False)
ax[1].get_xaxis().set_visible(False)
plt.show()
#histogram of the features
df.iloc[:, :30].hist(bins=20, figsize=(20,15))
plt.show()
The distribution seems to be a normal one. We have still decided to retain the few outliers, since the data seems still to follow a normal distribution. Now we could also analyse their correlation, in order to assess whether there is some relation between features. We are going to analyse the Pearson correlation coefficient, that measure the linear correlation between them.
#correlation matrix of the features
corr_matrix_p = df.corr(method = 'pearson')
plt.figure(figsize=(10,5))
sns.heatmap(corr_matrix_p, cmap='coolwarm', yticklabels = False, xticklabels = False)
plt.xlabel('Feature')
plt.ylabel('Feature')
plt.show()
#scatter matrix of the features
from pandas.plotting import scatter_matrix
attributes = ["feature_1", "feature_2", "feature_3", "feature_4", "feature_5"]
scatter_matrix(df[attributes], figsize=(12, 8));
The features seems to be uncorrelated between them.
Now, let's visualize a bit our data, to see how the various classes are distributed in the space. We are going to use PCA in order to visualize our data in a lower dimensional space. Recall that it is a linear dimensionality reduction tehcnique.
#apply PCA to our data
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
sc = StandardScaler()
sc.fit(df)
sc_data = sc.transform(df)
principal = PCA()
principal.fit(sc_data)
x = principal.transform(sc_data)
#plot our data in a 2D and 3D space
from mpl_toolkits.mplot3d import Axes3D
cmap = {0: 'red', 1: 'blue', 2: 'purple', 3: 'green'}
colors = [cmap[lab] for lab in df_label]
fig = plt.figure(figsize=(12, 6))
ax = [None, None]
ax[0] = fig.add_subplot(121)
ax[0].scatter(x[:,0], x[:,1], c=colors)
ax[0].set_xlabel('PC1')
ax[0].set_ylabel('PC2')
ax[0].grid(True)
handles = [plt.plot([], [], marker="o", ms=10, ls="", mec=None, color=val, label=key)[0] for key, val in cmap.items()]
ax[0].legend(handles=handles, numpoints=1, loc='lower right')
ax[1] = fig.add_subplot(122, projection='3d')
ax[1].scatter(x[:,0], x[:,1], x[:,2], c = colors)
ax[1].set_xlabel("PC1", fontsize=10)
ax[1].set_ylabel("PC2", fontsize=10)
ax[1].set_zlabel("PC3", fontsize=10)
plt.tight_layout()
plt.show()
Notice that both in 2D and in 3D the data is not linearly separable. However, in 3D there seems to be some sort of separation, at least for the 0 and 1 class. So, we might intuitively think that the 0 class will be classified better, followed by the 1 class, while the 2 and 3 class will be classified worst. Morevover, due to their proximity, we expect those two classes to be misclassified mainly between them.
Therefore, overall we don't expect a great performance of our classifiers, especially of logistic regression. This is because it has a linear decision boundary, and therefore it cannot classify non-linearly separable data well. Instead, random forest is a model that is capable of perfectly learning any decision boundary, if we don't give it any constraint. Therefore, we expect that, when initially training the model, it will be able to perfectly classify the training set. However, we don't expect the algorithm to generalize that well with also new data, since they don't seems to be well separated. To deal with data of this type, that are not much separable, it may be better to use distance-based classifiers.
#cumulative variance explained
cumsum = np.cumsum(principal.explained_variance_ratio_)
d = np.argmax(cumsum >= 0.95) + 1
plt.figure(figsize=(6, 4))
plt.plot(cumsum, linewidth=3)
plt.xlabel("Dimensions")
plt.ylabel("Explained Variance")
plt.grid(True)
plt.show()
#explained variance for each component
plt.plot(principal.explained_variance_ratio_)
plt.ylabel('Explained Variance')
plt.xlabel('Components')
plt.grid(True)
plt.show()
Notice that all of the features seems to convey information. In fact, from the cumulative explained variance, we see that lots of component are needed in order to get an high explained variance.
#import libraries
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, f1_score, classification_report, matthews_corrcoef, roc_auc_score, ConfusionMatrixDisplay
from sklearn import metrics
import warnings
warnings.filterwarnings('ignore')
Now, we want to actually start analysing the two classifiers. Our approach will be to first apply each classifier, and do hyperparameter tuning. Then, for each classifier, we want to check whether modifying the features will lead to changes in the performance (feature selection).
In order to start, the first thing that we want to do is to split the data into a training and a testing set, in order to perform cross validation and find the generalization error. This method is called hold-out method and it is the simplest. In particular, we are going to train the models on the training set, and finding the correct hyperparameters, and then we will evaluate their performance on the test set. While this is the simplest approach, it may come with a problem. That is, it may happen that this technique let us choose a model that fits well only with respect to the test set, and not general data. This is the case when the test set doesn't represent the population. In our case however, since we have almost 300 entries and we stratified it, we can expect the test set to represent well the entire population.
#splitting the dataset
X_train, X_test, y_train, y_test= train_test_split(df, df_label, stratify = df_label, test_size = 0.20, random_state = 42)
print("Training set size: ", X_train.shape)
print("Test set size: ", X_test.shape)
Training set size: (1040, 33) Test set size: (260, 33)
Notice that here we are doing stratified sampling, that is we are trying to get a testing sample that is representative of the whole population. This allow us not to introduce bias when performing this splitting.
Let's start with the Random Forest algorithm. Recall that Random Forest is a supervised machine learning model for classification. It is a form of ensemble learning, that uses decision trees and then performs a majority vote on their result to get a prediction.
First of all, we apply the standard Random Forest algorithm, to check how it performs.
#training the classifier
rf_classifier = RandomForestClassifier(n_estimators = 100, random_state = 42)
rf_classifier.fit(X_train, y_train)
y_pred = rf_classifier.predict(X_test)
Here we have used n_estimators=100 decision trees in order to perform our majority vote. After having trained the model on our training dataset, and after having predicted the testing labels, we want to check the performance of this model.
In order to do so, we are going to use various metrics, that provide detailed evaluation of a model performance at the class level. We are first going to use the confusion matrix. It is a nice way to visualize how our model classify the different classes. Each row of the matrix represents the instances in a predicted class, while each column represents the instances in an actual class.
#creating the confusion matrix
cm = confusion_matrix(y_test, y_pred)
ConfusionMatrixDisplay(confusion_matrix = cm).plot();
It may seem that the class for which the model behaves the best is the 0 and 1 class, while the 2 and 3 class seems to be classified in a bad way, and they are misclassified between them. This is exactly what we expected after having visualized the data through PCA! Moreover, we also see that the classifier seems to separate pretty well the 1 class with the 2 and 3. Now, we need some statistics to read better the matrix.
The evaluation metrics that we can use are the following:
Since here we are dealing with multiclass classification, some of these measures, such as precision and recall, cannot be computed directly. What it is done is to compute their statistic for each class, and then aggregating them, in order to extract a single number. They are aggregated using:
We go for micro-average since we have a balanced dataset, and since by doing this we obtain an understandable metric for the overall performance regardless of the class. However, notice that micro-average gives equals result in the case for Precision, Recall and F1. We are still going to write them.
We could use also other measures, such as:
#giving some metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average = 'micro')
recall = recall_score(y_test, y_pred, average = 'micro')
f1 = metrics.f1_score(y_test, y_pred, average = 'micro')
print("Accuracy:", accuracy.round(5))
print("Precision:", precision.round(5))
print("Recall:", recall.round(5))
print("F1 score: ", f1.round(5))
Accuracy: 0.77308 Precision: 0.77308 Recall: 0.77308 F1 score: 0.77308
#giving the ROC-AUC score, using macro in this case
def roc_auc_score_multiclass(actual_class, pred_class, average = "macro"):
unique_class = set(actual_class)
roc_auc_dict = {}
for per_class in unique_class:
other_class = [x for x in unique_class if x != per_class]
new_actual_class = [0 if x in other_class else 1 for x in actual_class]
new_pred_class = [0 if x in other_class else 1 for x in pred_class]
roc_auc = roc_auc_score(new_actual_class, new_pred_class, average = average)
roc_auc_dict[per_class] = roc_auc
return roc_auc_dict
lr_roc_auc_multiclass = roc_auc_score_multiclass(y_test, y_pred)
print(lr_roc_auc_multiclass)
temporary = []
for i in lr_roc_auc_multiclass.values():
temporary.append(i)
print("ROC-AUC Score: ", sum(temporary) / len(temporary))
{0: 0.92672647127059, 1: 0.8955685891169763, 2: 0.7714527435339618, 3: 0.8017769607843138}
ROC-AUC Score: 0.8488811911764604
#giving the matthews corr coefficient
mc = matthews_corrcoef(y_test,y_pred)
print('Matthews correlation coefficient:', mc.round(5))
Matthews correlation coefficient: 0.69964
Let's join each of this metric into a big function:
#create a performance metric function
def performance_metric(y_t, y_p):
print("-- PERFORMANCE OF THE MODEL -- ")
cm = confusion_matrix(y_t, y_p)
disp = ConfusionMatrixDisplay(confusion_matrix = cm);
accuracy = accuracy_score(y_t, y_p)
precision = precision_score(y_t, y_p, average = 'micro')
recall = recall_score(y_t, y_p, average = 'micro')
f1 = f1_score(y_t, y_p, average = 'micro')
print("General statistics:")
print(" - Accuracy:", accuracy.round(5))
print(" - Precision:", precision.round(5))
print(" - Recall:", recall.round(5))
print(" - F1 score: ", f1.round(5))
roc_auc_multiclass = roc_auc_score_multiclass(y_t, y_p)
temporary = []
for i in roc_auc_multiclass.values():
temporary.append(i)
roc_auc_score = sum(temporary) / len(temporary)
print(" - AUC Score: ", roc_auc_score.round(5))
mc = matthews_corrcoef(y_t, y_p)
print(' - Matthews_corrcoef: {0:.3f}'.format(mc.round(5)))
print(classification_report(y_t, y_p))
print("\nConfusion matrix:")
disp.plot()
plt.grid(False)
plt.show()
metrics = pd.Series({'accuracy': accuracy, 'precision': precision, 'recall': recall,'f1': f1,'roc_auc': roc_auc_score, 'matthews': mc, 'confusion': disp})
return metrics
Now we want to fine tune Random Forest, in order to try and improve the accuracy of the model. In particular, we are going to perforrm hyperparameter selection, to understand which parameters are better for our data.
Due to computational and time reasons, the approach that we will follow is the following. We are first going to apply Random Search Cross Validation, in which we consider a wide range of values for the parameters. This class tries out a given number of random combinations by selecting a random value for each hyperparameter at every iteration. It is usually faster than GridSearch; however it doesn't necessarily find the optimal. Moreover, if we were just to consider the best parameters obtained with this, this may not give us the best insight on the best parameters to use, since we have missed on lots of possible combinations.
Therefore, we will use the parameters that performed the best, and we will perform a Grid Search on them, in which all possible combinations of hyperparameters values will be tried, using cross-validation. Using cross validation in this step is important since, otherwise, there would be a risk of overfitting on the training set. By performing k-fold cross-validation for hyperparameter selection, the performance measure will be the average of the values computed for each split. This allows not to lose to much data from this process.
We decided this approach this since Grid Search is computationally intensive. So, while it is the most exhaustive method in which we can perform hyperparameter tuning, we might want to explore other ways in which we could perform it. So by doing this, we are trying to explore relatively few combinations extensively, after having decreased the size of our parameter space.
#definition of parameters to try for random search
from sklearn.model_selection import RandomizedSearchCV
n_estimators = [int(x) for x in np.linspace(start = 100, stop = 1000, num = 10)]
criterion = ['gini', 'entropy']
max_depth = [int(x) for x in np.linspace(10, 220, num = 8)]
min_samples_split = [2, 5, 10, 15, 20]
min_samples_leaf = [1, 2, 4, 8, 12]
max_features = [5, 10, 20, 'sqrt']
min_impurity_decrease = [0, 0.1, 0.2, 0.5, 0.8]
bootstrap = [True, False]
param_grid = {
'n_estimators': n_estimators,
'criterion': criterion,
'max_depth': max_depth,
'min_samples_split': min_samples_split,
'min_samples_leaf': min_samples_leaf,
'max_features': max_features,
'min_impurity_decrease': min_impurity_decrease,
'bootstrap': bootstrap
}
#apply random search
rf_random = RandomForestClassifier(random_state = 42, warm_start = True, n_jobs = -1)
rf_random_search = RandomizedSearchCV(rf_random, param_grid, n_iter = 2500, cv = 5, verbose = 1, random_state=42, n_jobs = -1)
rf_random_search.fit(X_train, y_train)
Fitting 5 folds for each of 2500 candidates, totalling 12500 fits
RandomizedSearchCV(cv=5,
estimator=RandomForestClassifier(n_jobs=-1, random_state=42,
warm_start=True),
n_iter=2500, n_jobs=-1,
param_distributions={'bootstrap': [True, False],
'criterion': ['gini', 'entropy'],
'max_depth': [10, 40, 70, 100, 130, 160,
190, 220],
'max_features': [5, 10, 20, 'sqrt'],
'min_impurity_decrease': [0, 0.1, 0.2,
0.5, 0.8],
'min_samples_leaf': [1, 2, 4, 8, 12],
'min_samples_split': [2, 5, 10, 15, 20],
'n_estimators': [100, 200, 300, 400,
500, 600, 700, 800,
900, 1000]},
random_state=42, verbose=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. RandomizedSearchCV(cv=5,
estimator=RandomForestClassifier(n_jobs=-1, random_state=42,
warm_start=True),
n_iter=2500, n_jobs=-1,
param_distributions={'bootstrap': [True, False],
'criterion': ['gini', 'entropy'],
'max_depth': [10, 40, 70, 100, 130, 160,
190, 220],
'max_features': [5, 10, 20, 'sqrt'],
'min_impurity_decrease': [0, 0.1, 0.2,
0.5, 0.8],
'min_samples_leaf': [1, 2, 4, 8, 12],
'min_samples_split': [2, 5, 10, 15, 20],
'n_estimators': [100, 200, 300, 400,
500, 600, 700, 800,
900, 1000]},
random_state=42, verbose=1)RandomForestClassifier(n_jobs=-1, random_state=42, warm_start=True)
RandomForestClassifier(n_jobs=-1, random_state=42, warm_start=True)
Now, we can actually see which are the best parameters and check their performance on the test set.
#best parameters for random search
print("Best parameters: ", rf_random_search.best_params_)
rf_random_model = rf_random_search.best_estimator_
Best parameters: {'n_estimators': 300, 'min_samples_split': 2, 'min_samples_leaf': 1, 'min_impurity_decrease': 0, 'max_features': 'sqrt', 'max_depth': 220, 'criterion': 'entropy', 'bootstrap': False}
#check the performance of best model on test set
rf_random_model.fit(X_train, y_train)
y_pred = rf_random_model.predict(X_test)
performance_metric(y_test, y_pred)
-- PERFORMANCE OF THE MODEL --
General statistics:
- Accuracy: 0.81154
- Precision: 0.81154
- Recall: 0.81154
- F1 score: 0.81154
- AUC Score: 0.87418
- Matthews_corrcoef: 0.749
precision recall f1-score support
0 0.85 0.91 0.88 67
1 0.79 0.85 0.82 62
2 0.76 0.71 0.74 63
3 0.84 0.76 0.80 68
accuracy 0.81 260
macro avg 0.81 0.81 0.81 260
weighted avg 0.81 0.81 0.81 260
Confusion matrix:
accuracy 0.811538 precision 0.811538 recall 0.811538 f1 0.811538 roc_auc 0.874178 matthews 0.749418 confusion <sklearn.metrics._plot.confusion_matrix.Confus... dtype: object
Notice that even from this simple search this model has improved. In particular, we passed from an accuracy of $0.773$ to an accuracy of $0.811$. We also have an higher value in the Auc score ($0.848$ vs $0.874$) and in the Matthews correlation coefficient ($0.699$ vs $0.749$). So overall all metrics improved.
Now, let's see more closely how do the best models in random search behave and what is their accuracy.
#giving the accuracy of the first 10 models
rs_results = pd.DataFrame(rf_random_search.cv_results_).sort_values('rank_test_score').reset_index(drop=True)
top_10_parameters= rs_results.iloc[:10, :]
models = {}
for i in top_10_parameters.index:
models[str(top_10_parameters.loc[i, 'params'])] = top_10_parameters.loc[i, 'split0_test_score':'split4_test_score'].values
results = []
names = []
for model, score in models.items():
results.append(score)
names.append(model)
plt.boxplot(results, showmeans = True)
plt.title("Accuracy of top 10 models")
plt.ylabel("Accuracy")
plt.show()
Notice that the best model we found has both an high mean and median.
top_10_parameters
| mean_fit_time | std_fit_time | mean_score_time | std_score_time | param_n_estimators | param_min_samples_split | param_min_samples_leaf | param_min_impurity_decrease | param_max_features | param_max_depth | ... | param_bootstrap | params | split0_test_score | split1_test_score | split2_test_score | split3_test_score | split4_test_score | mean_test_score | std_test_score | rank_test_score | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2.432691 | 0.676007 | 0.318548 | 0.284475 | 300 | 2 | 1 | 0 | sqrt | 220 | ... | False | {'n_estimators': 300, 'min_samples_split': 2, ... | 0.812500 | 0.725962 | 0.730769 | 0.826923 | 0.783654 | 0.775962 | 0.041313 | 1 |
| 1 | 13.739838 | 0.251095 | 0.810432 | 0.216933 | 900 | 5 | 2 | 0 | sqrt | 190 | ... | False | {'n_estimators': 900, 'min_samples_split': 5, ... | 0.807692 | 0.735577 | 0.730769 | 0.822115 | 0.778846 | 0.775000 | 0.036916 | 2 |
| 2 | 0.697334 | 0.270379 | 0.149600 | 0.099821 | 100 | 2 | 1 | 0 | 5 | 190 | ... | False | {'n_estimators': 100, 'min_samples_split': 2, ... | 0.812500 | 0.745192 | 0.745192 | 0.807692 | 0.764423 | 0.775000 | 0.029543 | 2 |
| 3 | 9.198190 | 0.221321 | 1.450120 | 0.195836 | 900 | 5 | 2 | 0 | sqrt | 40 | ... | False | {'n_estimators': 900, 'min_samples_split': 5, ... | 0.807692 | 0.735577 | 0.730769 | 0.822115 | 0.778846 | 0.775000 | 0.036916 | 2 |
| 4 | 1.671727 | 0.306434 | 0.339691 | 0.221298 | 200 | 2 | 2 | 0 | sqrt | 190 | ... | False | {'n_estimators': 200, 'min_samples_split': 2, ... | 0.812500 | 0.740385 | 0.725962 | 0.822115 | 0.769231 | 0.774038 | 0.038099 | 5 |
| 5 | 2.151842 | 0.427804 | 0.186900 | 0.081329 | 300 | 2 | 1 | 0 | 5 | 70 | ... | False | {'n_estimators': 300, 'min_samples_split': 2, ... | 0.798077 | 0.745192 | 0.745192 | 0.807692 | 0.774038 | 0.774038 | 0.025979 | 5 |
| 6 | 2.228039 | 0.353992 | 0.444810 | 0.239826 | 300 | 2 | 1 | 0 | sqrt | 220 | ... | False | {'n_estimators': 300, 'min_samples_split': 2, ... | 0.798077 | 0.745192 | 0.745192 | 0.807692 | 0.774038 | 0.774038 | 0.025979 | 5 |
| 7 | 7.619214 | 1.714232 | 1.678509 | 1.488194 | 600 | 2 | 2 | 0 | sqrt | 40 | ... | False | {'n_estimators': 600, 'min_samples_split': 2, ... | 0.812500 | 0.740385 | 0.735577 | 0.812500 | 0.769231 | 0.774038 | 0.033447 | 5 |
| 8 | 3.236939 | 0.560252 | 0.803450 | 0.320552 | 500 | 5 | 2 | 0 | sqrt | 130 | ... | False | {'n_estimators': 500, 'min_samples_split': 5, ... | 0.798077 | 0.725962 | 0.730769 | 0.822115 | 0.788462 | 0.773077 | 0.038148 | 9 |
| 9 | 5.749816 | 0.753724 | 1.266811 | 0.701573 | 900 | 5 | 2 | 0 | sqrt | 10 | ... | False | {'n_estimators': 900, 'min_samples_split': 5, ... | 0.802885 | 0.740385 | 0.730769 | 0.822115 | 0.769231 | 0.773077 | 0.035119 | 9 |
10 rows × 21 columns
Notice that also other models behave similarly and have similar parameters. For instance, we see that in the top peforming models we have always bootstrap = False. This gives us some insights about the space of hyperparameters and where it is better to search.
We could also plot the average performance for each value of hyperparameter in our search.
#modify the dataframe
rs_results = rs_results.drop([
'mean_fit_time',
'std_fit_time',
'mean_score_time',
'std_score_time',
'params',
'split0_test_score',
'split1_test_score',
'split2_test_score',
'std_test_score'],
axis=1)
#plotting avearage performance for each parameter
fig, axs = plt.subplots(ncols=4, nrows=2, figsize = (20,15))
sns.set(style="whitegrid", color_codes=True, font_scale = 1)
sns.barplot(x='param_n_estimators', y='mean_test_score', data=rs_results, ax=axs[0,0], color='lightgrey')
axs[0,0].set_title(label = 'n_estimators', size=20, weight='bold')
sns.barplot(x='param_min_samples_split', y='mean_test_score', data=rs_results, ax=axs[0,1], color='coral')
axs[0,1].set_title(label = 'min_samples_split', size=20, weight='bold')
sns.barplot(x='param_min_samples_leaf', y='mean_test_score', data=rs_results, ax=axs[0,2], color='lightgreen')
axs[0,2].set_title(label = 'min_samples_leaf', size=20, weight='bold')
sns.barplot(x='param_min_impurity_decrease', y='mean_test_score', data=rs_results, ax=axs[0,3], color='lightgrey')
axs[0,3].set_title(label = 'min_impurity_decrease', size=20, weight='bold')
sns.barplot(x='param_max_features', y='mean_test_score', data=rs_results, ax=axs[1,0], color='wheat')
axs[1,0].set_title(label = 'max_features', size=20, weight='bold')
sns.barplot(x='param_max_depth', y='mean_test_score', data=rs_results, ax=axs[1,1], color='lightpink')
axs[1,1].set_title(label = 'max_depth', size=20, weight='bold')
sns.barplot(x='param_bootstrap',y='mean_test_score', data=rs_results, ax=axs[1,2], color='skyblue')
axs[1,2].set_title(label = 'bootstrap', size=20, weight='bold')
sns.barplot(x='param_criterion', y='mean_test_score', data=rs_results, ax=axs[1,3], color='wheat')
axs[1,3].set_title(label = 'criterion', size=20, weight='bold')
plt.show()
Looking at the above plots, we can extract some insights about how well each value for each hyperparemeter performed on average. In particular we see that:
Therefore, based on them, we can move to another finer hyperparameter tuning, using Grid Search. We will also include in the parameters for Grid Search the ones of the best model found.
#definition of the parameters to try for Grid Search
from sklearn.model_selection import GridSearchCV
n_estimators = [300, 400, 450]
criterion = ['entropy']
max_depth = [40, 100, 180, 190, 220]
min_samples_split = [2, 8, 10, 12]
min_samples_leaf = [1, 2, 12, 15]
max_features = ['sqrt', 15, 20]
min_impurity_decrease = [0]
bootstrap = [False]
param_grid = {
'n_estimators': n_estimators,
'criterion': criterion,
'max_depth': max_depth,
'min_samples_split': min_samples_split,
'min_samples_leaf': min_samples_leaf,
'max_features': max_features,
'min_impurity_decrease': min_impurity_decrease,
'bootstrap': bootstrap
}
#performing grid search CV
rf_grid = RandomForestClassifier(random_state = 42, n_jobs = -1, warm_start = True)
rf_grid_search = GridSearchCV(rf_grid, param_grid, cv=5, scoring = 'accuracy', verbose = 1, n_jobs = -1, return_train_score=True)
rf_grid_search.fit(X_train, y_train)
Fitting 5 folds for each of 720 candidates, totalling 3600 fits
GridSearchCV(cv=5,
estimator=RandomForestClassifier(n_jobs=-1, random_state=42,
warm_start=True),
n_jobs=-1,
param_grid={'bootstrap': [False], 'criterion': ['entropy'],
'max_depth': [40, 100, 180, 190, 220],
'max_features': ['sqrt', 15, 20],
'min_impurity_decrease': [0],
'min_samples_leaf': [1, 2, 12, 15],
'min_samples_split': [2, 8, 10, 12],
'n_estimators': [300, 400, 450]},
return_train_score=True, scoring='accuracy', verbose=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GridSearchCV(cv=5,
estimator=RandomForestClassifier(n_jobs=-1, random_state=42,
warm_start=True),
n_jobs=-1,
param_grid={'bootstrap': [False], 'criterion': ['entropy'],
'max_depth': [40, 100, 180, 190, 220],
'max_features': ['sqrt', 15, 20],
'min_impurity_decrease': [0],
'min_samples_leaf': [1, 2, 12, 15],
'min_samples_split': [2, 8, 10, 12],
'n_estimators': [300, 400, 450]},
return_train_score=True, scoring='accuracy', verbose=1)RandomForestClassifier(n_jobs=-1, random_state=42, warm_start=True)
RandomForestClassifier(n_jobs=-1, random_state=42, warm_start=True)
Once we have trained it, we can actually obtain the best parameters and check their performance:
#obtaining the best parameters
print("Best parameters: ", rf_grid_search.best_params_)
rf_gs_best_model = rf_grid_search.best_estimator_
Best parameters: {'bootstrap': False, 'criterion': 'entropy', 'max_depth': 40, 'max_features': 'sqrt', 'min_impurity_decrease': 0, 'min_samples_leaf': 2, 'min_samples_split': 2, 'n_estimators': 300}
#performance of best model in test set
rf_gs_best_model.fit(X_train, y_train)
y_pred = rf_gs_best_model.predict(X_test)
rf_gs_best_model_metrics = performance_metric(y_test, y_pred)
-- PERFORMANCE OF THE MODEL --
General statistics:
- Accuracy: 0.82692
- Precision: 0.82692
- Recall: 0.82692
- F1 score: 0.82692
- AUC Score: 0.88454
- Matthews_corrcoef: 0.770
precision recall f1-score support
0 0.85 0.91 0.88 67
1 0.81 0.87 0.84 62
2 0.81 0.75 0.78 63
3 0.84 0.78 0.81 68
accuracy 0.83 260
macro avg 0.83 0.83 0.83 260
weighted avg 0.83 0.83 0.83 260
Confusion matrix:
Notice that it performs slightly better than the one obtained by Random Search. In particular we have an increase in accuracy of $0.15$, and also the other metrics have improved. Moreover, comparing it with the previous classifiers, we see that what seems to have improved is the ability of the model to classify correctly the 2 and 3 class. Also, when we look at the statistics only for the 0 class, there aren't improvements of precision, recall and f1-score.
Overall, this model performs well.
Before moving on, let's visualize how the accuracy changes in the best models.
#giving the performance of the best 10 models found
rf_gs_results = pd.DataFrame(rf_grid_search.cv_results_).sort_values('rank_test_score').reset_index(drop=True)
top_10_parameters= rf_gs_results.iloc[:10, :]
models = {}
for i in top_10_parameters.index:
models[str(top_10_parameters.loc[i, 'params'])] = top_10_parameters.loc[i, 'split0_test_score':'split4_test_score'].values
results = []
names = []
for model, score in models.items():
results.append(score)
names.append(model)
plt.boxplot(results, showmeans = True)
plt.title("Cross-validated accuracy of best models")
plt.show()
top_10_parameters
| mean_fit_time | std_fit_time | mean_score_time | std_score_time | param_bootstrap | param_criterion | param_max_depth | param_max_features | param_min_impurity_decrease | param_min_samples_leaf | ... | mean_test_score | std_test_score | rank_test_score | split0_train_score | split1_train_score | split2_train_score | split3_train_score | split4_train_score | mean_train_score | std_train_score | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2.492730 | 0.370648 | 1.919265 | 0.143564 | False | entropy | 180 | sqrt | 0 | 2 | ... | 0.778846 | 0.033722 | 1 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 0.0 |
| 1 | 2.452039 | 1.305935 | 1.910090 | 0.785296 | False | entropy | 220 | sqrt | 0 | 2 | ... | 0.778846 | 0.033722 | 1 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 0.0 |
| 2 | 3.820179 | 1.272117 | 1.464482 | 1.116295 | False | entropy | 190 | sqrt | 0 | 2 | ... | 0.778846 | 0.033722 | 1 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 0.0 |
| 3 | 3.531351 | 1.357857 | 1.573789 | 1.329590 | False | entropy | 100 | sqrt | 0 | 2 | ... | 0.778846 | 0.033722 | 1 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 0.0 |
| 4 | 2.374845 | 0.972114 | 2.310219 | 0.874939 | False | entropy | 40 | sqrt | 0 | 2 | ... | 0.778846 | 0.033722 | 1 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 0.0 |
| 5 | 5.184129 | 0.466838 | 0.473733 | 0.211612 | False | entropy | 180 | sqrt | 0 | 1 | ... | 0.776923 | 0.033530 | 6 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 0.0 |
| 6 | 4.220508 | 0.932175 | 0.957039 | 0.629906 | False | entropy | 40 | sqrt | 0 | 1 | ... | 0.776923 | 0.033530 | 6 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 0.0 |
| 7 | 5.226416 | 0.349563 | 0.397337 | 0.151333 | False | entropy | 220 | sqrt | 0 | 1 | ... | 0.776923 | 0.033530 | 6 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 0.0 |
| 8 | 5.572690 | 0.456409 | 0.554517 | 0.443839 | False | entropy | 100 | sqrt | 0 | 1 | ... | 0.776923 | 0.033530 | 6 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 0.0 |
| 9 | 5.836584 | 0.316536 | 0.344877 | 0.320955 | False | entropy | 190 | sqrt | 0 | 1 | ... | 0.776923 | 0.033530 | 6 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 0.0 |
10 rows × 28 columns
Notice that the best performing model have similar characteristics.
We could also visualize better the process of Grid Search. We do this by seeing how the score changes when modifying only one hyperparameter at a time, and keeping all the others fixed, starting from the best model found.
#plotting the score as the parameter changes
def plot_search_results(grid):
results = grid.cv_results_
means_test = results['mean_test_score']
stds_test = results['std_test_score']
masks=[]
masks_names= list(grid.best_params_.keys())
for p_k, p_v in grid.best_params_.items():
masks.append(list(results['param_'+p_k].data==p_v))
params=grid.param_grid
fig, ax = plt.subplots(2, int(len(params)/2), figsize=(28,10))
fig.suptitle('Score per parameter')
fig.text(0.08, 0.5, 'CV Average Score', va='center', rotation='vertical')
pram_preformace_in_best = {}
j = 0
for i, p in enumerate(masks_names):
m = np.stack(masks[:i] + masks[i+1:])
pram_preformace_in_best
best_parms_mask = m.all(axis=0)
best_index = np.where(best_parms_mask)[0]
x = np.array(params[p])
y_1 = np.array(means_test[best_index])
e_1 = np.array(stds_test[best_index])
if(i >= int(len(params)/2)):
i = i - int(len(params)/2)
j = 1
ax[j, i].errorbar(x, y_1, e_1, linestyle='--', marker='o', label='test')
ax[j, i].set_title(p)
plt.rc('xtick', labelsize=5)
plt.rc('ytick', labelsize=5)
plt.legend(loc="best", fontsize=15)
plt.show()
plot_search_results(rf_grid_search)
These visuals gives us a good idea on how the hyperparameters make the performance change.
Now, let's save this best model that we have obtained.
#best classifier of random forest
rf_best_classifier = rf_gs_best_model
rf_params = rf_best_classifier.get_params()
for param, value in rf_params.items():
print(param, ":", value)
bootstrap : False ccp_alpha : 0.0 class_weight : None criterion : entropy max_depth : 40 max_features : sqrt max_leaf_nodes : None max_samples : None min_impurity_decrease : 0 min_samples_leaf : 2 min_samples_split : 2 min_weight_fraction_leaf : 0.0 n_estimators : 300 n_jobs : -1 oob_score : False random_state : 42 verbose : 0 warm_start : True
We notice that, with respect to our initial classifier, there is an improvement. This is actually what we expected when doing cross validation. Now let's save the results in a dataframe.
#saving results of best model in dataframe
best_classifiers = pd.DataFrame(columns = ['Random Forest', 'Logistic Regression'], index = ['parameters', 'accuracy', 'precision', 'recall', 'f1', 'roc_auc', 'matthews', 'confusion'])
rf_gs_best_model_metrics['parameters'] = rf_params
best_classifiers['Random Forest'] = rf_gs_best_model_metrics
Now we could compute which features were the most relevant in our classification.
# Create a series containing feature importances from the model and feature names from the training data
feature_importances = pd.Series(rf_best_classifier.feature_importances_, index=X_train.columns).sort_values(ascending=False)
fig, ax = plt.subplots(figsize=(13, 6))
feature_importances.plot.bar(ax=ax);
plt.xlabel('Features')
plt.ylabel('Importance')
plt.title('Most Important Features')
plt.rc('xtick', labelsize=20)
plt.rc('ytick', labelsize=20)
plt.show()
We can also visualize the resulting forest. For instance, let's visualize the first tree in the random forest.
#visualize one of the tree in random forest
from sklearn.tree import plot_tree
fig = plt.figure(figsize=(15, 10))
plot_tree(rf_best_classifier.estimators_[0],
feature_names = df.columns.astype(str),
class_names = df_label.unique().astype(str),
filled=True, rounded=True)
plt.show()
Now we want to perform logistic regression. Let's recall that logistic regression is a supervised learning algorithm used for binary classification tasks. It is a generalized linear model, and it uses the logistic function in order to estimate the probability of an instance belonging to a certain class.
Differently from Random Forests, this algorithm does not handle directly multiclass classification. Therefore, in order to implement it, we could do the following:
Now we want to check which of this approaches is best.
Let's start by using the OvR (that is One-versus-Rest), and let's see how it performs on the test set.
#apply OvR classification
lr_classifier = LogisticRegression(multi_class='ovr', solver='liblinear')
lr_classifier.fit(X_train, y_train)
LogisticRegression(multi_class='ovr', solver='liblinear')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LogisticRegression(multi_class='ovr', solver='liblinear')
#OvR classification performance
y_pred = lr_classifier.predict(X_test)
performance_metric(y_test, y_pred)
-- PERFORMANCE OF THE MODEL --
General statistics:
- Accuracy: 0.76154
- Precision: 0.76154
- Recall: 0.76154
- F1 score: 0.76154
- AUC Score: 0.84025
- Matthews_corrcoef: 0.682
precision recall f1-score support
0 0.85 0.90 0.87 67
1 0.77 0.77 0.77 62
2 0.68 0.63 0.66 63
3 0.74 0.74 0.74 68
accuracy 0.76 260
macro avg 0.76 0.76 0.76 260
weighted avg 0.76 0.76 0.76 260
Confusion matrix:
accuracy 0.761538 precision 0.761538 recall 0.761538 f1 0.761538 roc_auc 0.840254 matthews 0.682003 confusion <sklearn.metrics._plot.confusion_matrix.Confus... dtype: object
In general, we can notice that the inital accuracy that we got is lower than the inital one when doing random forest. This suggests that Logistic Regression, in general, performs worse than Random Forest. Moreover, as before, we see that the model performs pretty well for the classification of the 0 and 1 class, while it performs worse with the 2 and 3 class and it misclassifies the data mainly between these 2.
Now, we can also see how the classification performs using the multinomial approach.
#apply multinomial classification and check the performance
lr_classifier = LogisticRegression(multi_class='multinomial',solver='newton-cg')
lr_classifier = lr_classifier.fit(X_train, y_train)
y_pred = lr_classifier.predict(X_test)
performance_metric(y_test, y_pred)
-- PERFORMANCE OF THE MODEL --
General statistics:
- Accuracy: 0.77308
- Precision: 0.77308
- Recall: 0.77308
- F1 score: 0.77308
- AUC Score: 0.84789
- Matthews_corrcoef: 0.697
precision recall f1-score support
0 0.84 0.87 0.85 67
1 0.80 0.77 0.79 62
2 0.71 0.67 0.69 63
3 0.74 0.78 0.76 68
accuracy 0.77 260
macro avg 0.77 0.77 0.77 260
weighted avg 0.77 0.77 0.77 260
Confusion matrix:
accuracy 0.773077 precision 0.773077 recall 0.773077 f1 0.773077 roc_auc 0.847886 matthews 0.697377 confusion <sklearn.metrics._plot.confusion_matrix.Confus... dtype: object
Here we get a better result than before, and a more similar result to Random Forest without hyperparameter tuning. Let's see later how they perform compared to each other more in depth.
For the multinomial approach, we could also get the probabilities estimates for each class for the new data point. The output is a vector of probabilities for each class.
#probabilites of belonging to a class for each sample
probabilities = lr_classifier.predict_proba(X_test)
print("Probabilities:", probabilities)
Probabilities: [[2.69426554e-05 1.68496453e-03 2.69354327e-01 7.28933766e-01] [6.71111374e-03 9.43731948e-01 2.37474413e-02 2.58094965e-02] [2.45995421e-04 2.42185519e-02 9.35936052e-01 3.95994007e-02] ... [4.74515363e-03 9.07826831e-01 1.43233600e-02 7.31046555e-02] [8.11863476e-03 2.65820090e-01 6.74353698e-01 5.17075771e-02] [6.76379977e-01 2.26857677e-01 3.59868197e-02 6.07755264e-02]]
We can also see how the classifier obtained using OvO performs.
#apply OvO classification and check the performance
from sklearn.multiclass import OneVsOneClassifier
OvO_clf = OneVsOneClassifier(LogisticRegression())
OvO_clf.fit(X_train, y_train)
y_pred = OvO_clf.predict(X_test)
performance_metric(y_test, y_pred)
-- PERFORMANCE OF THE MODEL --
General statistics:
- Accuracy: 0.75769
- Precision: 0.75769
- Recall: 0.75769
- F1 score: 0.75769
- AUC Score: 0.83736
- Matthews_corrcoef: 0.677
precision recall f1-score support
0 0.86 0.88 0.87 67
1 0.77 0.74 0.75 62
2 0.69 0.63 0.66 63
3 0.71 0.76 0.74 68
accuracy 0.76 260
macro avg 0.76 0.76 0.76 260
weighted avg 0.76 0.76 0.76 260
Confusion matrix:
accuracy 0.757692 precision 0.757692 recall 0.757692 f1 0.757692 roc_auc 0.837362 matthews 0.676922 confusion <sklearn.metrics._plot.confusion_matrix.Confus... dtype: object
Here, it performs the worst that we have seen so far.
In general, the performance of Logistic Regression seems to be worse than the one of random forests. This may be due to the fact that the data that we are dealing with doesn't seem to be linearly separable. However, before concluding this, first let's try fine tuning our model.
In this case, we have decided just to use Grid Search CV, since the process is overall much quicker than Random Forest. Therefore, we do not need to pass through Randomized Search as before, but we could directly pass to this more exhaustive approach.
#definition of parameters to try for grid search
penalty = ['l2', 'l1', 'elasticnet', 'None']
tol = [1e-1, 1e-2, 1e-3, 1e-4, 1e-5]
C = [0.001, 0.01, 0.1, 0.5, 1, 2, 5, 10]
solver = ['lbfgs', 'liblinear', 'newton-cholesky', 'saga']
max_iter = [int(x) for x in np.linspace(10, 1000, num=10)]
multi_class = ['ovr', 'multinomial']
l1_ratio = [x for x in np.linspace(0, 1, num=3)]
param_grid = {
'penalty': penalty,
'tol': tol,
'C': C,
'solver': solver,
'max_iter': max_iter,
'multi_class': multi_class,
'l1_ratio': l1_ratio
}
#apply grid search CV
lr_grid = LogisticRegression(random_state = 42)
lr_grid_search = GridSearchCV(lr_grid, param_grid, cv = 5, scoring = 'accuracy', return_train_score=True, verbose = 1)
lr_grid_search.fit(X_train, y_train)
Fitting 5 folds for each of 38400 candidates, totalling 192000 fits
GridSearchCV(cv=5, estimator=LogisticRegression(),
param_grid={'C': [0.001, 0.01, 0.1, 0.5, 1, 2, 5, 10],
'l1_ratio': [0.0, 0.5, 1.0],
'max_iter': [10, 120, 230, 340, 450, 560, 670, 780,
890, 1000],
'multi_class': ['ovr', 'multinomial'],
'penalty': ['l2', 'l1', 'elasticnet', 'None'],
'solver': ['lbfgs', 'liblinear', 'newton-cholesky',
'saga'],
'tol': [0.1, 0.01, 0.001, 0.0001, 1e-05]},
return_train_score=True, scoring='accuracy', verbose=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GridSearchCV(cv=5, estimator=LogisticRegression(),
param_grid={'C': [0.001, 0.01, 0.1, 0.5, 1, 2, 5, 10],
'l1_ratio': [0.0, 0.5, 1.0],
'max_iter': [10, 120, 230, 340, 450, 560, 670, 780,
890, 1000],
'multi_class': ['ovr', 'multinomial'],
'penalty': ['l2', 'l1', 'elasticnet', 'None'],
'solver': ['lbfgs', 'liblinear', 'newton-cholesky',
'saga'],
'tol': [0.1, 0.01, 0.001, 0.0001, 1e-05]},
return_train_score=True, scoring='accuracy', verbose=1)LogisticRegression()
LogisticRegression()
Once we have trained it, we can actually obtain the best parameters and check their performance:
#obtain the best parameters
print("Best parameters: ", lr_grid_search.best_params_)
lr_gs_best_model = lr_grid_search.best_estimator_
Best parameters: {'C': 0.1, 'l1_ratio': 0.0, 'max_iter': 120, 'multi_class': 'multinomial', 'penalty': 'l1', 'solver': 'saga', 'tol': 0.01}
#check the performance of the model
lr_gs_best_model.fit(X_train, y_train)
y_pred = lr_gs_best_model.predict(X_test)
lr_gs_best_model_metrics = performance_metric(y_test, y_pred)
-- PERFORMANCE OF THE MODEL --
General statistics:
- Accuracy: 0.78077
- Precision: 0.78077
- Recall: 0.78077
- F1 score: 0.78077
- AUC Score: 0.85287
- Matthews_corrcoef: 0.708
precision recall f1-score support
0 0.86 0.93 0.89 67
1 0.82 0.79 0.80 62
2 0.68 0.63 0.66 63
3 0.75 0.76 0.76 68
accuracy 0.78 260
macro avg 0.78 0.78 0.78 260
weighted avg 0.78 0.78 0.78 260
Confusion matrix:
We see an improvement with respect to the initial multinomial model of $0.07$. So it seems that this type of classifier can not learn much more from the data. The resulting performance of the model is worse than the one of Random Forest in all metrics; for instance in accuracy we have a change from $0.827$ to $0.781$. Moreover, we see that for instance the 2 and 3 class are more misclassified between them.
Now let's visualize how the accuracy changes in the best models.
#visualization of average accuracy score of best models
lr_gs_results = pd.DataFrame(lr_grid_search.cv_results_).sort_values('rank_test_score').reset_index(drop=True)
top_10_parameters= lr_gs_results.iloc[:10, :]
models = {}
for i in top_10_parameters.index:
models[str(top_10_parameters.loc[i, 'params'])] = top_10_parameters.loc[i, 'split0_test_score':'split4_test_score'].values
results = []
names = []
for model, score in models.items():
results.append(score)
names.append(model)
plt.boxplot(results, showmeans = True)
plt.title("Average accuracy score of the best models")
plt.show()
Many models seems to behave similarly in our dataset.
top_10_parameters
| mean_fit_time | std_fit_time | mean_score_time | std_score_time | param_C | param_l1_ratio | param_max_iter | param_multi_class | param_penalty | param_solver | ... | mean_test_score | std_test_score | rank_test_score | split0_train_score | split1_train_score | split2_train_score | split3_train_score | split4_train_score | mean_train_score | std_train_score | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.032924 | 0.003818 | 0.001782 | 0.000409 | 0.1 | 0.5 | 670 | multinomial | l1 | saga | ... | 0.760577 | 0.028099 | 1 | 0.771635 | 0.778846 | 0.776442 | 0.771635 | 0.772837 | 0.774279 | 0.002885 |
| 1 | 0.040890 | 0.007960 | 0.002195 | 0.000399 | 0.1 | 0.0 | 120 | multinomial | l1 | saga | ... | 0.760577 | 0.028099 | 1 | 0.771635 | 0.778846 | 0.777644 | 0.771635 | 0.772837 | 0.774519 | 0.003097 |
| 2 | 0.042094 | 0.007462 | 0.001790 | 0.000752 | 0.1 | 1.0 | 780 | multinomial | l1 | saga | ... | 0.760577 | 0.028099 | 1 | 0.771635 | 0.778846 | 0.777644 | 0.771635 | 0.772837 | 0.774519 | 0.003097 |
| 3 | 0.035490 | 0.004170 | 0.002400 | 0.000497 | 0.1 | 0.0 | 340 | multinomial | l1 | saga | ... | 0.760577 | 0.028099 | 1 | 0.772837 | 0.778846 | 0.776442 | 0.772837 | 0.772837 | 0.774760 | 0.002475 |
| 4 | 0.036688 | 0.000976 | 0.002001 | 0.000630 | 0.1 | 0.0 | 780 | multinomial | l1 | saga | ... | 0.760577 | 0.028099 | 1 | 0.772837 | 0.780048 | 0.776442 | 0.771635 | 0.772837 | 0.774760 | 0.003097 |
| 5 | 0.035511 | 0.007932 | 0.001597 | 0.000790 | 0.1 | 1.0 | 1000 | multinomial | l1 | saga | ... | 0.760577 | 0.028099 | 1 | 0.772837 | 0.778846 | 0.777644 | 0.771635 | 0.772837 | 0.774760 | 0.002905 |
| 6 | 0.039301 | 0.008711 | 0.001988 | 0.000631 | 0.1 | 1.0 | 450 | multinomial | elasticnet | saga | ... | 0.760577 | 0.028099 | 1 | 0.772837 | 0.778846 | 0.776442 | 0.771635 | 0.772837 | 0.774519 | 0.002698 |
| 7 | 0.038313 | 0.003195 | 0.001787 | 0.000756 | 0.1 | 0.0 | 450 | multinomial | l1 | saga | ... | 0.760577 | 0.028099 | 1 | 0.771635 | 0.778846 | 0.777644 | 0.771635 | 0.772837 | 0.774519 | 0.003097 |
| 8 | 0.038688 | 0.007938 | 0.001796 | 0.000990 | 0.1 | 0.0 | 560 | multinomial | l1 | saga | ... | 0.760577 | 0.028099 | 1 | 0.772837 | 0.778846 | 0.776442 | 0.771635 | 0.772837 | 0.774519 | 0.002698 |
| 9 | 0.034514 | 0.009134 | 0.001796 | 0.000746 | 0.1 | 1.0 | 230 | multinomial | elasticnet | saga | ... | 0.760577 | 0.028099 | 1 | 0.772837 | 0.778846 | 0.776442 | 0.771635 | 0.772837 | 0.774519 | 0.002698 |
10 rows × 27 columns
Notice that the parameters of the best performing models are similar.
We could also understand more how the parameters behave, when we keep fixed all of the others in the best model we found.
#plotting the score as the parameter changes
def plot_search_results(grid):
results = grid.cv_results_
means_test = results['mean_test_score']
stds_test = results['std_test_score']
masks=[]
masks_names= list(grid.best_params_.keys())
for p_k, p_v in grid.best_params_.items():
masks.append(list(results['param_'+p_k].data==p_v))
params=grid.param_grid
fig, ax = plt.subplots(2, int(len(params)/2 + 1), figsize=(24,8))
fig.suptitle('Score per parameter')
fig.text(0.08, 0.5, 'CV Average Score', va='center', rotation='vertical')
pram_preformace_in_best = {}
j = 1
for i, p in enumerate(masks_names):
m = np.stack(masks[:i] + masks[i+1:])
pram_preformace_in_best
best_parms_mask = m.all(axis=0)
best_index = np.where(best_parms_mask)[0]
x = np.array(params[p])
y_1 = np.array(means_test[best_index])
e_1 = np.array(stds_test[best_index])
if(i >= int(len(params)/2)):
i = i - int(len(params)/2)
j = 0
ax[j, i].errorbar(x, y_1, e_1, linestyle='--', marker='o', label='test')
ax[j, i].set_title(p)
plt.rc('ytick', labelsize=10)
plt.rc('xtick', labelsize=10)
plt.legend(loc="best", fontsize=15)
plt.show()
plot_search_results(lr_grid_search)
No artists with labels found to put in legend. Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
Let's apply the same thing also on Ovo. We'll be quicker.
#definition of parameters for grid search and grid search
penalty = ['l2', 'l1', 'elasticnet', 'None']
tol = [1e-2, 1e-3, 1e-4]
C = [0.01, 0.1, 0.5, 1, 2]
solver = ['lbfgs', 'liblinear', 'newton-cholesky', 'saga']
max_iter = [250, 300]
multi_class = ['ovr', 'multinomial']
param_grid = {
'estimator__penalty': penalty,
'estimator__tol': tol,
'estimator__C': C,
'estimator__solver': solver,
'estimator__max_iter': max_iter,
'estimator__multi_class': multi_class
}
ovo_grid = OneVsOneClassifier(LogisticRegression(random_state = 42))
ovo_grid_search = GridSearchCV(ovo_grid, param_grid, cv = 5, scoring = 'accuracy', return_train_score=True, verbose = 1)
ovo_grid_search.fit(X_train, y_train)
Fitting 5 folds for each of 960 candidates, totalling 4800 fits
GridSearchCV(cv=5,
estimator=OneVsOneClassifier(estimator=LogisticRegression(random_state=42)),
param_grid={'estimator__C': [0.01, 0.1, 0.5, 1, 2],
'estimator__max_iter': [250, 300],
'estimator__multi_class': ['ovr', 'multinomial'],
'estimator__penalty': ['l2', 'l1', 'elasticnet',
'None'],
'estimator__solver': ['lbfgs', 'liblinear',
'newton-cholesky', 'saga'],
'estimator__tol': [0.01, 0.001, 0.0001]},
return_train_score=True, scoring='accuracy', verbose=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GridSearchCV(cv=5,
estimator=OneVsOneClassifier(estimator=LogisticRegression(random_state=42)),
param_grid={'estimator__C': [0.01, 0.1, 0.5, 1, 2],
'estimator__max_iter': [250, 300],
'estimator__multi_class': ['ovr', 'multinomial'],
'estimator__penalty': ['l2', 'l1', 'elasticnet',
'None'],
'estimator__solver': ['lbfgs', 'liblinear',
'newton-cholesky', 'saga'],
'estimator__tol': [0.01, 0.001, 0.0001]},
return_train_score=True, scoring='accuracy', verbose=1)OneVsOneClassifier(estimator=LogisticRegression(random_state=42))
LogisticRegression(random_state=42)
LogisticRegression(random_state=42)
#getting the best parameters
print("best parameters: ", ovo_grid_search.best_params_)
ovo_gs_best_model = ovo_grid_search.best_estimator_
best parameters: {'estimator__C': 0.5, 'estimator__max_iter': 250, 'estimator__multi_class': 'multinomial', 'estimator__penalty': 'l1', 'estimator__solver': 'saga', 'estimator__tol': 0.0001}
#checking the performance of the model
ovo_gs_best_model.fit(X_train, y_train)
y_pred = ovo_gs_best_model.predict(X_test)
performance_metric(y_test, y_pred)
-- PERFORMANCE OF THE MODEL --
General statistics:
- Accuracy: 0.76538
- Precision: 0.76538
- Recall: 0.76538
- F1 score: 0.76538
- AUC Score: 0.84253
- Matthews_corrcoef: 0.687
precision recall f1-score support
0 0.86 0.90 0.88 67
1 0.78 0.76 0.77 62
2 0.69 0.63 0.66 63
3 0.72 0.76 0.74 68
accuracy 0.77 260
macro avg 0.76 0.76 0.76 260
weighted avg 0.76 0.77 0.76 260
Confusion matrix:
accuracy 0.765385 precision 0.765385 recall 0.765385 f1 0.765385 roc_auc 0.842526 matthews 0.687176 confusion <sklearn.metrics._plot.confusion_matrix.Confus... dtype: object
Here, we see that the result performs worse than the multinomial approach (we have a change in accuracy from $0.781$ to $0.765$). Therefore, our of all the 3 approaches, the one that performs the best is the multinomial one, but it still doesn't perform as well as Random Forest.
Now let's save the best model that we found.
#best classifier of logistic regression
lr_best_classifier = lr_gs_best_model
lr_params = lr_best_classifier.get_params()
for param, value in lr_params.items():
print(param, ":", value)
C : 0.1 class_weight : None dual : False fit_intercept : True intercept_scaling : 1 l1_ratio : 0.0 max_iter : 120 multi_class : multinomial n_jobs : None penalty : l1 random_state : None solver : saga tol : 0.01 verbose : 0 warm_start : False
#saving results of best model in dataframe
lr_gs_best_model_metrics['parameters'] = lr_params
best_classifiers['Logistic Regression'] = lr_gs_best_model_metrics
We could also try to visualize the most important features in this case. To do, since we have a separate set of coefficients for each class, we are going to take the average of the absolute values of the coefficients across all classes.
# Create a series containing feature importances from the model and feature names from the training data
coefficients = lr_best_classifier.coef_
avg_importance = np.mean(np.abs(coefficients), axis=0)
feature_importance2 = pd.Series(avg_importance, index=X_train.columns).sort_values(ascending=False)
fig, ax = plt.subplots(figsize=(13, 6))
feature_importance2.plot.bar(ax=ax);
plt.xlabel('Features')
plt.ylabel('Importance')
plt.title('Most Important Features')
plt.rc('xtick', labelsize=20)
plt.rc('ytick', labelsize=20)
plt.show()
In this part, we want to improve the accuracy of our classifiers. Since we have already done hyperparameter selection, we could try to use other methods in order to improve the performance.
The first thing that we want to focus on is in decreasing the number of features. In particular, extra features can decrease performance because they may “confuse” the model by giving it irrelevant data that prevents it from learning the actual relationships. So this is done in order to reduce the dimensionality of the data, avoid overfitting, and improve the interpretability of our data. The random forest performs implicit feature selection because it splits nodes on the most important variables, but other machine learning models do not. While here we have just around 30 features, that are not many, we would still like to see if some of them are not important for the classification.
We are going to use various methods in order to deal with this, and check whether this will bring some improvements in our classifier.
First of all, let's write a piece of code that will allow to compare efficiently random forest and logistic regression.
def performance_metric2(y_t, y_p1, y_p2):
print("-- PERFORMANCE OF THE RANDOM FOREST AND LOGISTIC REGRESSION-- ")
cm1 = confusion_matrix(y_t, y_p1)
disp1 = ConfusionMatrixDisplay(confusion_matrix = cm1);
cm2 = confusion_matrix(y_t, y_p2)
disp2 = ConfusionMatrixDisplay(confusion_matrix = cm2);
accuracy1 = accuracy_score(y_t, y_p1)
precision1 = precision_score(y_t, y_p1, average = 'micro')
recall1 = recall_score(y_t, y_p1, average = 'micro')
f11 = f1_score(y_t, y_p1, average = 'micro')
accuracy2 = accuracy_score(y_t, y_p2)
precision2 = precision_score(y_t, y_p2, average = 'micro')
recall2 = recall_score(y_t, y_p2, average = 'micro')
f12 = f1_score(y_t, y_p2, average = 'micro')
print("General statistics:")
print(" RANDOM FOREST: LOGISTIC REGRESSION:")
print(" - Accuracy:", accuracy1.round(5), " ", accuracy2.round(5))
print(" - Precision:", precision1.round(5), " ", precision2.round(5))
print(" - Recall:", recall1.round(5), " ", recall2.round(5))
print(" - F1 score:", f11.round(5), " ", f12.round(5))
roc_auc_multiclass1 = roc_auc_score_multiclass(y_t, y_p1)
roc_auc_multiclass2 = roc_auc_score_multiclass(y_t, y_p2)
temporary1 = []
for i in roc_auc_multiclass1.values():
temporary1.append(i)
roc_auc_score1 = sum(temporary1) / len(temporary1)
temporary2 = []
for i in roc_auc_multiclass2.values():
temporary2.append(i)
roc_auc_score2 = sum(temporary2) / len(temporary2)
print(" - AUC Score:", roc_auc_score1.round(5), " ", roc_auc_score2.round(5))
mc1 = matthews_corrcoef(y_t, y_p1)
mc2 = matthews_corrcoef(y_t, y_p2)
print(' - Matthews corrcoef:', mc1.round(5), " ", mc2.round(5))
print("\nConfusion matrix:")
fig, ax = plt.subplots(1,2, figsize=(24,8))
fig.suptitle('Confusion Matrix')
disp1.plot(ax = ax[0])
ax[0].set_title("Random Forest")
disp2.plot(ax = ax[1])
ax[1].set_title("Logistic Regression")
ax[0].grid(False)
ax[1].grid(False)
plt.show()
Embedded Methods are methods for features selection that are incorporated within the model training process itself. These methods aim to select the most relevant features by considering their impact on the model's performance during training. So, instead of performing feature selection as a separate preprocessing step, embedded methods evaluate the importance or relevance of features directly within the model's optimization process.
Here we are going to use Random Forest, that is we are going to use the most relevant features obtained by our best performing random forest, in order to train Logistic Regression and Random Forest. For each subset, we are going to check their score on the test set, and check which performs better.
#plotting the performance varying the number of feature to consider
def plot_performance(estimator1, estimator2, feat_importance):
results_rf = []
results_lr = []
for i in range(1, len(X_train.columns)):
X_train_modified = X_train[feat_importance.nlargest(i).index].copy()
results_rf.append(cross_val_score(estimator1, X_train_modified, y_train, scoring = 'accuracy', cv = 5))
results_lr.append(cross_val_score(estimator2, X_train_modified, y_train, scoring = 'accuracy', cv = 5))
fig, ax = plt.subplots(1,2, figsize=(24,8))
fig.suptitle('Score per number of features')
for i in range(0, len(X_train.columns)-1):
x = np.array(i+1)
y_1 = np.array(np.mean(results_rf[i]))
e_1 = np.array(np.std(results_rf[i]))
y_2 = np.array(np.mean(results_lr[i]))
e_2 = np.array(np.std(results_lr[i]))
ax[0].errorbar(x, y_1, e_1, linestyle='--', marker='o')
ax[0].set_title("Random Forest")
ax[1].errorbar(x, y_2, e_2, linestyle='--', marker='o')
ax[1].set_title("Logistic Regression")
plt.show()
#plotting how the models behave using the most iportant feature of random forest
feature_importances = pd.Series(rf_best_classifier.feature_importances_, index=X_train.columns).sort_values(ascending=False)
plot_performance(rf_best_classifier, lr_best_classifier, feature_importances)
In both cases, the best performance seems to happen when we have around 18 features.
feature_importances = pd.Series(rf_best_classifier.feature_importances_, index=X_train.columns).sort_values(ascending=False)
feature_importances.nlargest(18).index
Index(['C', 'B', 'A', 'feature_24', 'feature_15', 'feature_7', 'feature_3',
'feature_4', 'feature_5', 'feature_25', 'feature_21', 'feature_16',
'feature_18', 'feature_30', 'feature_12', 'feature_2', 'feature_11',
'feature_8'],
dtype='object')
#performance of random forest and logistic regression
rf_best_classifier = RandomForestClassifier(bootstrap = False, ccp_alpha = 0.0, class_weight = None, criterion = 'entropy', max_depth = 40, max_features = 'sqrt', max_leaf_nodes = None, max_samples = None, min_impurity_decrease = 0, min_samples_leaf = 2, min_samples_split = 2, min_weight_fraction_leaf = 0.0, n_estimators = 300, n_jobs = -1, oob_score = False, random_state = 42, warm_start = True)
X_train_mod = X_train[feature_importances.nlargest(18).index].copy()
X_test_mod = X_test[feature_importances.nlargest(18).index].copy()
rf_best_classifier.fit(X_train_mod, y_train)
y_pred_mod1 = rf_best_classifier.predict(X_test_mod)
lr_best_classifier.fit(X_train_mod, y_train)
y_pred_mod2 = lr_best_classifier.predict(X_test_mod)
performance_metric2(y_test, y_pred_mod1, y_pred_mod2)
-- PERFORMANCE OF THE RANDOM FOREST AND LOGISTIC REGRESSION-- General statistics: RANDOM FOREST: LOGISTIC REGRESSION: - Accuracy: 0.83846 0.79615 - Precision: 0.83846 0.79615 - Recall: 0.83846 0.79615 - F1 score: 0.83846 0.79615 - AUC Score: 0.89203 0.86342 - Matthews corrcoef: 0.78549 0.72825 Confusion matrix:
Notice that we get an increase in the performance in both cases. In particular, our best random forest model passed from having an accuracy of $0.827$ to an accuracy of $0.838$, while the best logistic regression model passed from an accuracy of $0.781$ to an accuracy of $0.796$. Moreover, we see also that the other metrics have overall improved. In general, we still see that the performance of random forest is better than the one of logistic regression.
Filter methods are feature selection techniques that select features based on their intrinsic properties. In particular, these methods rely on statistical measures or scoring criteria to evaluate the relevance or importance of features, and then select a subset of features based on these scores. There are many measures that we could use to understand which features to keep. In particular, we could use:
Let's try to see how our model behaves using information gain.
from sklearn.feature_selection import mutual_info_classif
importances = mutual_info_classif(X_train, y_train)
feat_importances = pd.Series(importances, X_train.columns).sort_values(ascending = False)
rf_best_classifier = RandomForestClassifier(bootstrap = False, ccp_alpha = 0.0, class_weight = None, criterion = 'entropy', max_depth = 40, max_features = 'sqrt', max_leaf_nodes = None, max_samples = None, min_impurity_decrease = 0, min_samples_leaf = 2, min_samples_split = 2, min_weight_fraction_leaf = 0.0, n_estimators = 300, n_jobs = -1, oob_score = False, random_state = 42, warm_start = True)
plot_performance(rf_best_classifier, lr_best_classifier, feat_importances)
Notice that in this case we have the best performing model when we have all the features, so it isn't really useful to remove features based on the information gain in this case.
Wrapper methods are feature selection techniques that select subsets of features by evaluating them through the actual model performance. They search the space of all possible subsets of features, assessing their quality by learning and evaluating a classifier with that feature subset. It follows a greedy search approach. We could do:
Let's start with the Forward Feature Selection. It is an iterative approach that starts with an empty set of features and greedily adds the most important feature at each step based on the model's performance. We are going to perform it for Random Forest and Logistic Regression.
#train the selector of features
from sklearn.feature_selection import SequentialFeatureSelector
#RANDOM FOREST
rf_best_classifier = RandomForestClassifier(bootstrap = False, ccp_alpha = 0.0, class_weight = None, criterion = 'entropy', max_depth = 40, max_features = 'sqrt', max_leaf_nodes = None, max_samples = None, min_impurity_decrease = 0, min_samples_leaf = 2, min_samples_split = 2, min_weight_fraction_leaf = 0.0, n_estimators = 300, n_jobs = -1, oob_score = False, random_state = 42, warm_start = True)
selector = SequentialFeatureSelector(rf_best_classifier, direction = 'forward', scoring = 'accuracy', cv = 5, n_jobs = -1)
selector = selector.fit(X_train, y_train)
#get the features
mask = selector.get_support()
best_features = X_train.columns[mask]
print("Best features of random forest: ", best_features)
#check the performance
X_train_modified = X_train[best_features].copy()
X_test_modified = X_test[best_features].copy()
rf_best_classifier.fit(X_train_modified, y_train)
y_pred_mod1 = rf_best_classifier.predict(X_test_modified)
#LOGISTIC REGRESSION
selector = SequentialFeatureSelector(lr_best_classifier, direction = 'forward', scoring = 'accuracy', cv = 5, n_jobs = -1)
selector = selector.fit(X_train, y_train)
#get the features
mask = selector.get_support()
best_features = X_train.columns[mask]
print("Best features of logistic regression: ", best_features)
#check the performance
X_train_modified = X_train[best_features].copy()
X_test_modified = X_test[best_features].copy()
lr_best_classifier.fit(X_train_modified, y_train)
y_pred_mod2 = lr_best_classifier.predict(X_test_modified)
performance_metric2(y_test, y_pred_mod1, y_pred_mod2)
Best features of random forest: Index(['feature_2', 'feature_3', 'feature_4', 'feature_5', 'feature_7',
'feature_11', 'feature_12', 'feature_15', 'feature_18', 'feature_19',
'feature_24', 'feature_25', 'feature_30', 'A', 'B', 'C'],
dtype='object')
Best features of logistic regression: Index(['feature_2', 'feature_3', 'feature_4', 'feature_5', 'feature_7',
'feature_11', 'feature_12', 'feature_14', 'feature_15', 'feature_18',
'feature_21', 'feature_24', 'feature_25', 'A', 'B', 'C'],
dtype='object')
-- PERFORMANCE OF THE RANDOM FOREST AND LOGISTIC REGRESSION--
General statistics:
RANDOM FOREST: LOGISTIC REGRESSION:
- Accuracy: 0.81923 0.77692
- Precision: 0.81923 0.77692
- Recall: 0.81923 0.77692
- F1 score: 0.81923 0.77692
- AUC Score: 0.87947 0.85105
- Matthews corrcoef: 0.75979 0.70276
Confusion matrix:
In this case, we get a decrease in the overall performance of both models, mainly due to the wrong classification of the 2 and 3 class. This might have happened since we have removed too many features from the model that were informative.
Let's also analyse the Backward Elimination Method. It is simlar to the previous one, but it works in the opposite direction. It starts with all features and iteratively removes the least important feature based on the model's performance. Let's repeat analogously of what we have done before.
#check the performance of Backward Elimination Method
from sklearn.feature_selection import SequentialFeatureSelector
#RANDOM FOREST
rf_best_classifier = RandomForestClassifier(bootstrap = False, ccp_alpha = 0.0, class_weight = None, criterion = 'entropy', max_depth = 40, max_features = 'sqrt', max_leaf_nodes = None, max_samples = None, min_impurity_decrease = 0, min_samples_leaf = 2, min_samples_split = 2, min_weight_fraction_leaf = 0.0, n_estimators = 300, n_jobs = -1, oob_score = False, random_state = 42, warm_start = True)
selector = SequentialFeatureSelector(rf_best_classifier, direction = 'backward', scoring = 'accuracy', cv = 5, n_jobs = -1)
selector = selector.fit(X_train, y_train)
mask = selector.get_support()
best_features = X_train.columns[mask]
print("Best features of random forest: ", best_features)
X_train_modified = X_train[best_features].copy()
X_test_modified = X_test[best_features].copy()
rf_best_classifier.fit(X_train_modified, y_train)
y_pred_mod1 = rf_best_classifier.predict(X_test_modified)
#LOGISTIC REGRESSION
selector = SequentialFeatureSelector(lr_best_classifier, direction = 'backward', scoring = 'accuracy', cv = 5, n_jobs = -1)
selector = selector.fit(X_train, y_train)
mask = selector.get_support()
best_features = X_train.columns[mask]
print("Best features of logistic regression: ", best_features)
X_train_modified = X_train[best_features].copy()
X_test_modified = X_test[best_features].copy()
lr_best_classifier.fit(X_train_modified, y_train)
y_pred_mod2 = lr_best_classifier.predict(X_test_modified)
performance_metric2(y_test, y_pred_mod1, y_pred_mod2)
Best features of random forest: Index(['feature_2', 'feature_3', 'feature_4', 'feature_5', 'feature_7',
'feature_11', 'feature_15', 'feature_16', 'feature_18', 'feature_21',
'feature_24', 'feature_25', 'feature_29', 'feature_30', 'A', 'B'],
dtype='object')
Best features of logistic regression: Index(['feature_3', 'feature_4', 'feature_5', 'feature_7', 'feature_11',
'feature_12', 'feature_15', 'feature_16', 'feature_18', 'feature_21',
'feature_23', 'feature_24', 'feature_25', 'A', 'B', 'C'],
dtype='object')
-- PERFORMANCE OF THE RANDOM FOREST AND LOGISTIC REGRESSION--
General statistics:
RANDOM FOREST: LOGISTIC REGRESSION:
- Accuracy: 0.82308 0.77308
- Precision: 0.82308 0.77308
- Recall: 0.82308 0.77308
- F1 score: 0.82308 0.77308
- AUC Score: 0.88172 0.84831
- Matthews corrcoef: 0.76448 0.69761
Confusion matrix:
In this case we still get a worst performance in both models, even if slighter than before.
We could also do Recursive Feature Elimination. RFE also starts with all features and iteratively removes the least important feature(s) based on the model's performance. It is similar to Backward Elimination.
from sklearn.feature_selection import RFECV
#RANDOM FOREST
rf_best_classifier = RandomForestClassifier(bootstrap = False, ccp_alpha = 0.0, class_weight = None, criterion = 'entropy', max_depth = 40, max_features = 'sqrt', max_leaf_nodes = None, max_samples = None, min_impurity_decrease = 0, min_samples_leaf = 2, min_samples_split = 2, min_weight_fraction_leaf = 0.0, n_estimators = 300, n_jobs = -1, oob_score = False, random_state = 42, warm_start = True)
selector = RFECV(rf_best_classifier, cv = 5, scoring = 'accuracy', n_jobs = -1)
selector = selector.fit(X_train, y_train)
mask = selector.get_support()
best_features = X_train.columns[mask]
print("Best features of random forest: ", best_features)
X_train_modified = X_train[best_features].copy()
X_test_modified = X_test[best_features].copy()
rf_best_classifier.fit(X_train_modified, y_train)
y_pred_mod1 = rf_best_classifier.predict(X_test_modified)
#LOGISTIC REGRESSION
selector = RFECV(lr_best_classifier, cv = 5, scoring = 'accuracy', n_jobs = -1)
selector = selector.fit(X_train, y_train)
mask = selector.get_support()
best_features = X_train.columns[mask]
print("Best features of logistic regression: ", best_features)
X_train_modified = X_train[best_features].copy()
X_test_modified = X_test[best_features].copy()
lr_best_classifier.fit(X_train_modified, y_train)
y_pred_mod2 = lr_best_classifier.predict(X_test_modified)
performance_metric2(y_test, y_pred_mod1, y_pred_mod2)
Best features of random forest: Index(['feature_2', 'feature_3', 'feature_4', 'feature_5', 'feature_7',
'feature_10', 'feature_11', 'feature_12', 'feature_15', 'feature_16',
'feature_18', 'feature_21', 'feature_24', 'feature_25', 'feature_30',
'A', 'B', 'C'],
dtype='object')
Best features of logistic regression: Index(['feature_1', 'feature_2', 'feature_3', 'feature_4', 'feature_5',
'feature_7', 'feature_9', 'feature_11', 'feature_15', 'feature_16',
'feature_18', 'feature_21', 'feature_23', 'feature_24', 'feature_25',
'feature_26', 'feature_30', 'A', 'B', 'C'],
dtype='object')
-- PERFORMANCE OF THE RANDOM FOREST AND LOGISTIC REGRESSION--
General statistics:
RANDOM FOREST: LOGISTIC REGRESSION:
- Accuracy: 0.82692 0.78846
- Precision: 0.82692 0.78846
- Recall: 0.82692 0.78846
- F1 score: 0.82692 0.78846
- AUC Score: 0.88421 0.8583
- Matthews corrcoef: 0.76987 0.71787
Confusion matrix:
In this case, we don't get an improvement for Random Forest, while we get an improvement of $0.08$ for Logistic Regression.
Now we want to try and reduce the dimensionality of the data, to check how the model behaves, using PCA.
We are going to apply PCA to both models, varying the number of PCA at each step, and then see how the models perform.
#visualize the perofrmance of the models wrt to different number of PC
ss = StandardScaler()
X_train_scaled = ss.fit_transform(X_train)
y_train = np.array(y_train)
rf_best_classifier = RandomForestClassifier(bootstrap = False, ccp_alpha = 0.0, class_weight = None, criterion = 'entropy', max_depth = 40, max_features = 'sqrt', max_leaf_nodes = None, max_samples = None, min_impurity_decrease = 0, min_samples_leaf = 2, min_samples_split = 2, min_weight_fraction_leaf = 0.0, n_estimators = 300, n_jobs = -1, oob_score = False, random_state = 42, warm_start = True)
def plot_performance2(estimator1, estimator2):
results_rf = []
results_lr = []
for i in range(1, len(X_train.columns)):
pca = PCA(n_components = i)
pca.fit(X_train_scaled)
X_train_scaled_pca = pca.transform(X_train_scaled)
results_rf.append(cross_val_score(estimator1, X_train_scaled_pca, y_train, scoring = 'accuracy', cv = 5))
results_lr.append(cross_val_score(estimator2, X_train_scaled_pca, y_train, scoring = 'accuracy', cv = 5))
fig, ax = plt.subplots(1, 2, figsize=(24,8))
fig.suptitle('Score per number of features')
for i in range(0, len(X_train.columns)-1):
x = np.array(i + 1)
y_1 = np.array(np.mean(results_rf[i]))
e_1 = np.array(np.std(results_rf[i]))
y_2 = np.array(np.mean(results_lr[i]))
e_2 = np.array(np.std(results_lr[i]))
ax[0].errorbar(x, y_1, e_1, linestyle='--', marker='o')
ax[0].set_title("Random Forest")
ax[1].errorbar(x, y_2, e_2, linestyle='--', marker='o')
ax[1].set_title("Logistic Regression")
plt.show()
plot_performance2(rf_best_classifier, lr_best_classifier)
Notice that the best performance is obtained when we have all PC's, so we cannot reduce the dataframe in this way. This can be explained due to the fact that the variance of the data is explained by lots of principal components.
After having perform this analysis, we clearly see that the model that overall performs better is Random Forest. This is in accordance with our expectation, since we have seen from PCA that the data was not linearly separable in 2D and in 3D. Moreover, we have seen that the classifier mainly struggles to distinguish between the 2 and 3 class.
In particular, the model that performs the best is the one of Random Forest, with parameters:
bootstrap : False
ccp_alpha : 0.0
class_weight : None
criterion : entropy
max_depth : 40
max_features : sqrt
max_leaf_nodes : None
max_samples : None
min_impurity_decrease : 0
min_samples_leaf : 2
min_samples_split : 2
min_weight_fraction_leaf : 0.0
n_estimators : 300
n_jobs : -1
oob_score : False
random_state : 42
verbose : 0
warm_start : True
and with reduced features:
['C', 'B', 'A', 'feature_24', 'feature_15', 'feature_7', 'feature_3', 'feature_4', 'feature_5', 'feature_25', 'feature_21', 'feature_16', 'feature_18', 'feature_30', 'feature_12', 'feature_2', 'feature_11', 'feature_8']
Now we want to import the file containing the test features on which we will output the predictions of our best model.
test_df = pd.read_csv("./data/mldata_0003164501.TEST_FEATURES.csv", sep = ",")
test_df
| id | feature_1 | feature_2 | feature_3 | feature_4 | feature_5 | feature_6 | feature_7 | feature_8 | feature_9 | ... | feature_22 | feature_23 | feature_24 | feature_25 | feature_26 | feature_27 | feature_28 | feature_29 | feature_30 | categorical_feature_1 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0.216251 | 1.600011 | 3.124432 | 1.007526 | 1.248892 | 0.048878 | -1.063602 | 0.256748 | -0.659555 | ... | 0.962239 | -0.779614 | -1.260093 | 0.092691 | 2.274102 | 1.574947 | 1.261300 | -0.264117 | 4.560051 | B |
| 1 | 1 | -1.084906 | -1.087918 | -7.716479 | -1.935018 | 3.081061 | 0.365231 | 2.270539 | -0.493430 | 1.310316 | ... | 0.642725 | 0.948476 | 0.156880 | -0.949245 | -0.200635 | -1.980347 | 0.855991 | -0.438245 | 2.587744 | C |
| 2 | 2 | 1.106605 | -1.545707 | -2.163345 | -1.813406 | 0.203347 | 1.633511 | 0.791142 | 1.033637 | -0.030198 | ... | -0.486569 | -0.341142 | -1.352579 | 1.379190 | 1.703867 | -0.060112 | -0.283798 | 0.661002 | 1.840297 | C |
| 3 | 3 | -2.042224 | 0.823476 | -2.285369 | -2.270773 | -0.668559 | -1.578110 | -0.257186 | -0.827421 | -1.256929 | ... | -0.960816 | -1.070933 | 2.171056 | 0.204803 | 0.620946 | -2.036895 | -1.177620 | -0.785916 | 0.173098 | C |
| 4 | 4 | -0.581274 | 0.363198 | -3.570156 | -4.286992 | 5.044998 | -0.397267 | 4.118269 | 0.691859 | 0.523798 | ... | 1.105811 | -0.490820 | 0.952442 | -3.772214 | -1.089031 | 1.407776 | -0.136638 | 0.089052 | -0.307349 | C |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1295 | 1295 | 1.831470 | 0.953282 | -1.071169 | -2.489733 | 1.812463 | -0.374221 | -3.150657 | -2.100736 | -0.854273 | ... | -1.182196 | -0.744190 | 1.014653 | -2.621318 | 0.962818 | 0.312637 | 1.541494 | 0.898536 | -5.367114 | B |
| 1296 | 1296 | -0.063951 | -0.471322 | 3.985702 | -2.218718 | 0.870567 | -0.764831 | 0.873860 | 0.588270 | -2.135085 | ... | 1.069542 | 1.230174 | -0.155100 | -2.571712 | 0.679225 | -0.681569 | -1.413776 | 0.201078 | -3.400294 | A |
| 1297 | 1297 | 0.654051 | -1.157193 | -1.193945 | -4.021244 | -2.909455 | 0.086752 | 2.514698 | -1.439090 | -0.369860 | ... | 1.346860 | 1.038488 | -0.288692 | 4.636710 | 0.463763 | 0.847172 | -0.841730 | 1.457939 | 5.344950 | C |
| 1298 | 1298 | 0.759557 | -4.173832 | -0.045697 | 5.252229 | -0.855698 | -0.798322 | 2.746694 | -1.207565 | 2.051560 | ... | 0.223734 | -0.540067 | 2.701097 | -2.223971 | -2.755301 | -1.551487 | -0.050303 | 0.515680 | -5.663085 | A |
| 1299 | 1299 | 0.694421 | 2.003003 | 2.290706 | -0.327494 | -0.445400 | 0.827480 | 0.778321 | 0.237986 | 0.690612 | ... | -0.080420 | 0.679049 | -0.970881 | 1.244630 | -0.486578 | 1.107635 | -0.723676 | 1.403534 | -0.416362 | C |
1300 rows × 32 columns
Let's apply the transformations needed in order to apply the classifier algorithm.
test_df.set_index('id', drop = True, inplace = True)
encoder = OneHotEncoder(handle_unknown='ignore')
encoder_df = pd.DataFrame(encoder.fit_transform(test_df[['categorical_feature_1']]).toarray())
encoder_df.columns = ['A', 'B', 'C']
test_df = test_df.join(encoder_df)
test_df.drop('categorical_feature_1', axis = 1, inplace = True)
test_df
| feature_1 | feature_2 | feature_3 | feature_4 | feature_5 | feature_6 | feature_7 | feature_8 | feature_9 | feature_10 | ... | feature_24 | feature_25 | feature_26 | feature_27 | feature_28 | feature_29 | feature_30 | A | B | C | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| id | |||||||||||||||||||||
| 0 | 0.216251 | 1.600011 | 3.124432 | 1.007526 | 1.248892 | 0.048878 | -1.063602 | 0.256748 | -0.659555 | -0.733519 | ... | -1.260093 | 0.092691 | 2.274102 | 1.574947 | 1.261300 | -0.264117 | 4.560051 | 0.0 | 1.0 | 0.0 |
| 1 | -1.084906 | -1.087918 | -7.716479 | -1.935018 | 3.081061 | 0.365231 | 2.270539 | -0.493430 | 1.310316 | -0.757843 | ... | 0.156880 | -0.949245 | -0.200635 | -1.980347 | 0.855991 | -0.438245 | 2.587744 | 0.0 | 0.0 | 1.0 |
| 2 | 1.106605 | -1.545707 | -2.163345 | -1.813406 | 0.203347 | 1.633511 | 0.791142 | 1.033637 | -0.030198 | 0.621888 | ... | -1.352579 | 1.379190 | 1.703867 | -0.060112 | -0.283798 | 0.661002 | 1.840297 | 0.0 | 0.0 | 1.0 |
| 3 | -2.042224 | 0.823476 | -2.285369 | -2.270773 | -0.668559 | -1.578110 | -0.257186 | -0.827421 | -1.256929 | 1.121218 | ... | 2.171056 | 0.204803 | 0.620946 | -2.036895 | -1.177620 | -0.785916 | 0.173098 | 0.0 | 0.0 | 1.0 |
| 4 | -0.581274 | 0.363198 | -3.570156 | -4.286992 | 5.044998 | -0.397267 | 4.118269 | 0.691859 | 0.523798 | -0.432748 | ... | 0.952442 | -3.772214 | -1.089031 | 1.407776 | -0.136638 | 0.089052 | -0.307349 | 0.0 | 0.0 | 1.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1295 | 1.831470 | 0.953282 | -1.071169 | -2.489733 | 1.812463 | -0.374221 | -3.150657 | -2.100736 | -0.854273 | -0.299772 | ... | 1.014653 | -2.621318 | 0.962818 | 0.312637 | 1.541494 | 0.898536 | -5.367114 | 0.0 | 1.0 | 0.0 |
| 1296 | -0.063951 | -0.471322 | 3.985702 | -2.218718 | 0.870567 | -0.764831 | 0.873860 | 0.588270 | -2.135085 | -1.167302 | ... | -0.155100 | -2.571712 | 0.679225 | -0.681569 | -1.413776 | 0.201078 | -3.400294 | 1.0 | 0.0 | 0.0 |
| 1297 | 0.654051 | -1.157193 | -1.193945 | -4.021244 | -2.909455 | 0.086752 | 2.514698 | -1.439090 | -0.369860 | -0.814236 | ... | -0.288692 | 4.636710 | 0.463763 | 0.847172 | -0.841730 | 1.457939 | 5.344950 | 0.0 | 0.0 | 1.0 |
| 1298 | 0.759557 | -4.173832 | -0.045697 | 5.252229 | -0.855698 | -0.798322 | 2.746694 | -1.207565 | 2.051560 | 0.536480 | ... | 2.701097 | -2.223971 | -2.755301 | -1.551487 | -0.050303 | 0.515680 | -5.663085 | 1.0 | 0.0 | 0.0 |
| 1299 | 0.694421 | 2.003003 | 2.290706 | -0.327494 | -0.445400 | 0.827480 | 0.778321 | 0.237986 | 0.690612 | -0.251945 | ... | -0.970881 | 1.244630 | -0.486578 | 1.107635 | -0.723676 | 1.403534 | -0.416362 | 0.0 | 0.0 | 1.0 |
1300 rows × 33 columns
Let's apply the classifier. Even though we have done hyperparameters selection only on X_train, we will train our model now on all of our dataset, in order to get a model which can hopefully generalize better on the unseen data. While we know that we don't have now a clear idea of the generalization error, we expect that adding also X_test to the training set should not change the best hyperparameters. This is because we expect data to come from the same distribution, and we have seen that more or less the best performing hyperparameters were not subject to drastical changes in the best models. Therefore, we might think that what we have found to be optimal before will also now be optimal. To check another measure of robustness, we could also try to see the cross validation score across the whole dataframe, and see whether it is comparable to the one seen before.
most_important_features = ['C', 'B', 'A', 'feature_24', 'feature_15', 'feature_7', 'feature_3',
'feature_4', 'feature_5', 'feature_25', 'feature_21', 'feature_16',
'feature_18', 'feature_30', 'feature_12', 'feature_2', 'feature_11',
'feature_8']
final_classifier = RandomForestClassifier(bootstrap = False, ccp_alpha = 0.0, class_weight = None, criterion = 'entropy', max_depth = 40, max_features = 'sqrt', max_leaf_nodes = None, max_samples = None, min_impurity_decrease = 0, min_samples_leaf = 2, min_samples_split = 2, min_weight_fraction_leaf = 0.0, n_estimators = 300, n_jobs = -1, oob_score = False, random_state = 42, warm_start = True)
print("cross validation score on X_train: ", cross_val_score(final_classifier, X_train, y_train, scoring = 'accuracy', cv = 5, n_jobs = -1).mean())
print("cross validation score on the whole dataframe: ", cross_val_score(final_classifier, df, df_label, scoring = 'accuracy', cv = 5, n_jobs = -1).mean())
cross validation score on X_train: 0.7788461538461539 cross validation score on the whole dataframe: 0.7753846153846153
Notice that the cross validation score is similar to the one obtained before. So we can conclude that the generalization performance will be similar.
#performance of random forest and logistic regression
test_df_mod = test_df[most_important_features].copy()
df_mod = df[most_important_features].copy()
final_classifier.fit(df_mod, df_label)
y_pred_final = final_classifier.predict(test_df_mod)
y_pred_final
array([2, 0, 0, ..., 0, 1, 0], dtype=int64)
file_path = "./output_test.txt"
np.savetxt(file_path, y_pred_final, delimiter='\n', fmt="%d")